Closed pieterlukasse closed 8 years ago
@n1zea144 : can you maybe help with this?
@pieterlukasse Good catch noticing that. In that cna file there are rows per gene with a different hugo_symbol (a different open reading frame) but the same entrez_gene_id (eg. Hugo:CXorf22 Entrez:286464; Hugo:CHDC2 Entrez:286464, the open reading frame maps to the same entrez gene ID). When these get processed they end up being duplicates in the temp loading file and only one is inserted in the table.
These rows with different ORFs also do not have the same events per sample - meaning some events are being lost or changed.
@jjgao @n1zea144 How do we want to proceed? Do we need to set up a way to handle genes that map to the same entrez id?
@zheins The data does not make sense. CXorf22
and CHDC2
are both aliases of CFAP47
(entrez: 286464
). But the same gene, there should be only one copy number value per sample. I am wondering if there is any bug in the GISTIC code. Could you download the Firehose GISTIC result and check if the error is there?
@jj, @zheins : any updates? I discussed this with @priti88 today and I think it makes sense to throw an exception in this case. I am in the process of letting the java layer abort with non zero exit status in case of errors (as part of #848) and this would be such an error. It currently occurs when loading the staging files from BRCA, which means that it is a real problem for #839, especially if we start throwing exception and aborting the import process (which again, I think we should in this case).
@pieterlukasse: If you throw an exception, would that break the whole importing?
@jjgao seems to be an issue only for this specific CNA file from the BRCA TCGA study. @priti88 tested for her own study and did not face this issue for her CNA file. To answer you question: yes, throwing an exception stops the import process. But this is exactly what we want in case of unexpected issues.
@jjgao as discussed today, the validator and loader correctly deal with any duplications now by skipping duplicated entries. With this fix in the loader (which will be part of #1020), the error above did not occur any more. Therefore I am closing this issue. The fix also includes a change that lets the MySQLbulkLoader
throw an error if this situation is encountered again (which will always be because of programming errors in some other layer of the loader).
I am getting an error while importing the test study (study_es_0) for the first time. It works on second attempt. Error:
Finished recaching...
--> profile id: 3
--> profile name: Putative copy-number alterations from GISTIC
--> genetic alteration type: COPY_NUMBER_ALTERATION
--> total number of samples: 778
--> total number of data lines: 7
--> records inserted into `genetic_alteration` table: 7
--> records inserted into `sample_cna_event` table: 21
ABORTED! Error: DB Error: only 21 of the 63 records were inserted in `sample_cna_event`.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error occurred during data loading step. Please fix the problem and run this again to make sure study is completely loaded
re-assigning to @pieterlukasse.
@pieterlukasse: this is high priority to fix. can you please look into this? :-)
I've come across some weird issues here, and I think they're originating in 05adf92606c107fe8dde328974864cd09951925e. Basically, the tests are weird due to closing, and I think the change in this commit makes it more common in tests for loadDataFromTempFileIntoDBMS
to throw. In normal code, this would be good, but it actually means that flushAll
skips both clearing the current file handle map and restoring foreign_key_checks
, which is really bad.
The short version of this is: I don't think the exception handling in flushAll
and loadDataFromTempFileIntoDBMS
are valid yet, and they are creating a file state issue in the bulkloading logic which breaks a whole bunch of tests in ImportTabDelimData
.
There's actually a good argument for some better testing of MySQLbulkLoader
. My gut feel here is that validation errors here should may not throw an exception directly, maybe even at all, but set a list of errors which can be picked up at the end of flushAll
and will then throw a DaoException
.
There are other solutions, but generally, an error in loadDataFromTempFileIntoDBMS
should not leave broken state in a calling flushAll
.
hi @morungos good catch in flushAll, I missed that. Maybe we can do the clearing of the current file handle map and restoring of the foreign_key_checks
in a finally block in flushAll
?
Regarding ImportTabDelimData
, I think other things are wrong there. I started looking at it.
Regarding "clearing current file handle map": do you mean tempFileHandle.delete()
? I think it is skipped on purpose to keep the file (for troubleshooting purposes) in case of an error.
@pieterlukasse I'd actually meant the mySQLbulkLoaders.clear();
-- this leaves bulkloaders hanging around in the mySQLbulkLoaders
table. I guess they should really be all flushed no matter what, but failing that, at least clearing the table would allow them to continue. Again, I think this is an area where our use of statics is problematic, as it leaves the JVM state around which means a bad test can break following tests. Converting MySQLbulkLoader
to a Spring-injected component would be a huge improvement, as then we could have two classes, one for bulkloading, and one for direct, and the Spring setup could select between them rather than having a variable define the behaviour within a single class. It'd also be easier to test.
And @pieterlukasse -- yes, I'd agree there are other issues in ImportTabDelimData
-- I'm trying to get the tests passing again here, and I'm still blocked on the mRna and RPPA data loading
Here's where ImportTabDelimData
is still an issue, and I think the issues with bulkloading were masking issues by throwing DB exceptions before we got to these. These look like true loading logic issues to me.
Both of these happen with bulkloading off, so it is possible it's an ordering issue, with genes being added later and referential integrity not being right at a first stage. This is the misery of MySQL.
First, note that parseRPPAGenes
doesn't handle |
in a gene entry, and it looks from this test data that it should.
java.lang.RuntimeException: Aborted: Error found for row starting with ARAF|A-Raf_pS299: Cannot add or update a child row: a foreign key constraint fails (`cgds_test`.`genetic_alteration`, CONSTRAINT `genetic_alteration_ibfk_1` FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`))
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.storeGeneticAlterations(ImportTabDelimData.java:462)
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.parseLine(ImportTabDelimData.java:388)
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:197)
at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.importRppaData(TestImportTabDelimData.java:419)
at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.testImportRppaDataBulkloadOff(TestImportTabDelimData.java:369)
Second:
java.lang.RuntimeException: Aborted: Error found for row starting with NA: Cannot add or update a child row: a foreign key constraint fails (`cgds_test`.`genetic_alteration`, CONSTRAINT `genetic_alteration_ibfk_1` FOREIGN KEY (`ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`))
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.storeGeneticAlterations(ImportTabDelimData.java:462)
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.parseLine(ImportTabDelimData.java:388)
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:197)
at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.importmRnaData2(TestImportTabDelimData.java:344)
at org.mskcc.cbio.portal.scripts.TestImportTabDelimData.testImportmRnaData2BulkloadOff(TestImportTabDelimData.java:303)
OK, debugging. importmRnaData2
tests are failing because NA
is logging Gene NA will be interpreted as 'Not Available' in this case. Record will be skipped for this gene.; 1x
-- the test expression data uses NA
everywhere and I don't understand this stuff too well, but it's on both the symbol and the gene id, and I think the symbol one is triggering missing all of the NAs. If this is about testing the aliasing, NA
is not a good test.
@morungos : the tests on NA
are on purpose. Since NA
is a valid alias for a gene, we want to make sure row 9 of the test file gets imported correctly. Other scenarios in this test file are added to check if the loader will skip duplicate records correctly (although these assertions are not automated in the unit test yet).
Closing this ticket as the issue for which it is logged is now solved.
"NA" is handled specially by ImportTabDelimData
, which is marked as a workaround because of bug in firehose. See https://github.com/cBioPortal/cbioportal/issues/839#issuecomment-203523078.
Testing an alias is right, but all I was saying is that "NA" isn't the best test for a gene alias, because the Java code handles this string specially. This was the rationale for my comment. I might have been wrong, but it was based on code inspection.
@morungos : thanks for being diligent in this. Here are some more details: the comment you saw only applies to RPPA data and the logic is implemented only for this type of profile in ImportTabDelimData.parseRPPAGenes
. Personally, I would advocate for having a separate parser for RPPA data since the format is different. But this is actually (as I understand it) something that has recently been reverted (see #730). For all other data, the principle that NA
is simply a valid symbol should apply.
When running the importer, I get the following messages
Notice the last line:
Error: only 648723 of the 651121 records were inserted.
This is resulting from the check inMySQLbulkLoader.loadDataFromTempFileIntoDBMS
function:Question is: how can we check what is the problem? Why are some rows not added? Has this been troubleshooted before?