cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
633 stars 487 forks source link

Not all CNA data gets imported #844

Closed pieterlukasse closed 8 years ago

pieterlukasse commented 8 years ago

When running the importer, I get the following messages

Reading data from:  /media/hyve/ssd/datasets/brca/tcga/data_CNA.txt
Recaching...
Finished recaching...
--> profile id:  157
--> profile name:  Putative copy-number alterations from GISTIC
--> genetic alteration type:  COPY_NUMBER_ALTERATION
--> total number of lines:  23310
Import tab delimited data for 1080 samples.
23218 records inserted into genetic_alteration
648723 records inserted into sample_cna_event
Error: only 648723 of the 651121 records were inserted.

Notice the last line: Error: only 648723 of the 651121 records were inserted. This is resulting from the check in MySQLbulkLoader.loadDataFromTempFileIntoDBMS function:

if (nLines!=updateCount && !processingClinicalData()) {
             System.err.println("Error: 

Question is: how can we check what is the problem? Why are some rows not added? Has this been troubleshooted before?

pieterlukasse commented 8 years ago

@n1zea144 : can you maybe help with this?

zheins commented 8 years ago

@pieterlukasse Good catch noticing that. In that cna file there are rows per gene with a different hugo_symbol (a different open reading frame) but the same entrez_gene_id (eg. Hugo:CXorf22 Entrez:286464; Hugo:CHDC2 Entrez:286464, the open reading frame maps to the same entrez gene ID). When these get processed they end up being duplicates in the temp loading file and only one is inserted in the table.

These rows with different ORFs also do not have the same events per sample - meaning some events are being lost or changed.

@jjgao @n1zea144 How do we want to proceed? Do we need to set up a way to handle genes that map to the same entrez id?

jjgao commented 8 years ago

@zheins The data does not make sense. CXorf22 and CHDC2 are both aliases of CFAP47 (entrez: 286464). But the same gene, there should be only one copy number value per sample. I am wondering if there is any bug in the GISTIC code. Could you download the Firehose GISTIC result and check if the error is there?

pieterlukasse commented 8 years ago

@jj, @zheins : any updates? I discussed this with @priti88 today and I think it makes sense to throw an exception in this case. I am in the process of letting the java layer abort with non zero exit status in case of errors (as part of #848) and this would be such an error. It currently occurs when loading the staging files from BRCA, which means that it is a real problem for #839, especially if we start throwing exception and aborting the import process (which again, I think we should in this case).

jjgao commented 8 years ago

@pieterlukasse: If you throw an exception, would that break the whole importing?

pieterlukasse commented 8 years ago

@jjgao seems to be an issue only for this specific CNA file from the BRCA TCGA study. @priti88 tested for her own study and did not face this issue for her CNA file. To answer you question: yes, throwing an exception stops the import process. But this is exactly what we want in case of unexpected issues.

pieterlukasse commented 8 years ago

@jjgao as discussed today, the validator and loader correctly deal with any duplications now by skipping duplicated entries. With this fix in the loader (which will be part of #1020), the error above did not occur any more. Therefore I am closing this issue. The fix also includes a change that lets the MySQLbulkLoader throw an error if this situation is encountered again (which will always be because of programming errors in some other layer of the loader).

priti88 commented 8 years ago

I am getting an error while importing the test study (study_es_0) for the first time. It works on second attempt. Error:

Finished recaching...
--> profile id:  3
--> profile name:  Putative copy-number alterations from GISTIC
--> genetic alteration type:  COPY_NUMBER_ALTERATION
--> total number of samples: 778
--> total number of data lines:  7
--> records inserted into `genetic_alteration` table: 7
--> records inserted into `sample_cna_event` table: 21

ABORTED! Error:  DB Error: only 21 of the 63 records were inserted in `sample_cna_event`.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error occurred during data loading step. Please fix the problem and run this again to make sure study is completely loaded
ecerami commented 8 years ago

re-assigning to @pieterlukasse.

@pieterlukasse: this is high priority to fix. can you please look into this? :-)

morungos commented 8 years ago

I've come across some weird issues here, and I think they're originating in 05adf92606c107fe8dde328974864cd09951925e. Basically, the tests are weird due to closing, and I think the change in this commit makes it more common in tests for loadDataFromTempFileIntoDBMS to throw. In normal code, this would be good, but it actually means that flushAll skips both clearing the current file handle map and restoring foreign_key_checks, which is really bad.

The short version of this is: I don't think the exception handling in flushAll and loadDataFromTempFileIntoDBMS are valid yet, and they are creating a file state issue in the bulkloading logic which breaks a whole bunch of tests in ImportTabDelimData.

morungos commented 8 years ago

There's actually a good argument for some better testing of MySQLbulkLoader. My gut feel here is that validation errors here should may not throw an exception directly, maybe even at all, but set a list of errors which can be picked up at the end of flushAll and will then throw a DaoException.

There are other solutions, but generally, an error in loadDataFromTempFileIntoDBMS should not leave broken state in a calling flushAll.

pieterlukasse commented 8 years ago

hi @morungos good catch in flushAll, I missed that. Maybe we can do the clearing of the current file handle map and restoring of the foreign_key_checks in a finally block in flushAll?

Regarding ImportTabDelimData, I think other things are wrong there. I started looking at it.

Regarding "clearing current file handle map": do you mean tempFileHandle.delete()? I think it is skipped on purpose to keep the file (for troubleshooting purposes) in case of an error.

morungos commented 8 years ago

@pieterlukasse I'd actually meant the mySQLbulkLoaders.clear(); -- this leaves bulkloaders hanging around in the mySQLbulkLoaders table. I guess they should really be all flushed no matter what, but failing that, at least clearing the table would allow them to continue. Again, I think this is an area where our use of statics is problematic, as it leaves the JVM state around which means a bad test can break following tests. Converting MySQLbulkLoader to a Spring-injected component would be a huge improvement, as then we could have two classes, one for bulkloading, and one for direct, and the Spring setup could select between them rather than having a variable define the behaviour within a single class. It'd also be easier to test.

morungos commented 8 years ago

And @pieterlukasse -- yes, I'd agree there are other issues in ImportTabDelimData -- I'm trying to get the tests passing again here, and I'm still blocked on the mRna and RPPA data loading

morungos commented 8 years ago

Here's where ImportTabDelimData is still an issue, and I think the issues with bulkloading were masking issues by throwing DB exceptions before we got to these. These look like true loading logic issues to me.

Both of these happen with bulkloading off, so it is possible it's an ordering issue, with genes being added later and referential integrity not being right at a first stage. This is the misery of MySQL.

First, note that parseRPPAGenes doesn't handle | in a gene entry, and it looks from this test data that it should.

java.lang.RuntimeException: Aborted: Error found for row starting with ARAF|A-Raf_pS299: Cannot add or update a child row: a foreign key constraint fails (`cgds_test`.`genetic_alteration`, CONSTRAINT `genetic_alteration_ibfk_1` FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`))
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.storeGeneticAlterations(ImportTabDelimData.java:462)
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.parseLine(ImportTabDelimData.java:388)
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:197)
    at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.importRppaData(TestImportTabDelimData.java:419)
    at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.testImportRppaDataBulkloadOff(TestImportTabDelimData.java:369)

Second:

java.lang.RuntimeException: Aborted: Error found for row starting with NA: Cannot add or update a child row: a foreign key constraint fails (`cgds_test`.`genetic_alteration`, CONSTRAINT `genetic_alteration_ibfk_1` FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`))
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.storeGeneticAlterations(ImportTabDelimData.java:462)
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.parseLine(ImportTabDelimData.java:388)
    at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:197)
    at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.importmRnaData2(TestImportTabDelimData.java:344)
    at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.testImportmRnaData2BulkloadOff(TestImportTabDelimData.java:303)
morungos commented 8 years ago

OK, debugging. importmRnaData2 tests are failing because NA is logging Gene NA will be interpreted as 'Not Available' in this case. Record will be skipped for this gene.; 1x -- the test expression data uses NA everywhere and I don't understand this stuff too well, but it's on both the symbol and the gene id, and I think the symbol one is triggering missing all of the NAs. If this is about testing the aliasing, NA is not a good test.

pieterlukasse commented 8 years ago

@morungos : the tests on NA are on purpose. Since NA is a valid alias for a gene, we want to make sure row 9 of the test file gets imported correctly. Other scenarios in this test file are added to check if the loader will skip duplicate records correctly (although these assertions are not automated in the unit test yet).

Closing this ticket as the issue for which it is logged is now solved.

morungos commented 8 years ago

"NA" is handled specially by ImportTabDelimData, which is marked as a workaround because of bug in firehose. See https://github.com/cBioPortal/cbioportal/issues/839#issuecomment-203523078.

Testing an alias is right, but all I was saying is that "NA" isn't the best test for a gene alias, because the Java code handles this string specially. This was the rationale for my comment. I might have been wrong, but it was based on code inspection.

pieterlukasse commented 8 years ago

@morungos : thanks for being diligent in this. Here are some more details: the comment you saw only applies to RPPA data and the logic is implemented only for this type of profile in ImportTabDelimData.parseRPPAGenes. Personally, I would advocate for having a separate parser for RPPA data since the format is different. But this is actually (as I understand it) something that has recently been reverted (see #730). For all other data, the principle that NA is simply a valid symbol should apply.