cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
610 stars 465 forks source link

Staging files for TCGA according to new validation standards #839

Closed pieterlukasse closed 6 years ago

pieterlukasse commented 8 years ago

The expected staging files format has changed a bit in #798 , so we need the new export of TCGA staging files to comply to that.

Logging this issue to keep track of the last issues to resolve.

pieterlukasse commented 8 years ago

@n1zea144 : these are the remaining issues in the last copy of http://cbio.mskcc.org/~grossb/brca-tcga.zip:

pieterlukasse commented 8 years ago

@aderidder : can you help review this? @ecerami : this is where you can follow the progress of the staging files effort :arrow_up:

n1zea144 commented 8 years ago

I've addressed the issues I can in the immediate future.

@jjgao has code in genomic-overview.js to map chr 23->X, 24->Y

I'm not sure how to answer many/most of the issues related to gene symbol/id mapping. One issue is a difference in gene info between the tcga converter and the info stored in the target cbioportal database. Another issue is hugo symbols coming out of the broad firehose that map to multiple entrez ids or to entries in NCBI file that have no associated gene symbols. For now the only thing may be to flag them?

I've uploaded a new brca-tcga.zip file to the web server for more testing.

aderidder commented 8 years ago

@n1zea144 thanks for the new version with the fixes!

Here's my feedback. For this test I removed: gistic and mutsig files as they are not validated yet. The study does not contain case lists, nor does the meta_study contain the add_global_case_list flag.

First validation

I fixed these two issues by merging the files and creating a new meta file. I ran into another issue whilst merging: There are several patients which are in the samples file, but not in the patients file (e.g. TCGA-E2-A1IP).

Second validation Green Results

Yellow Results

Red Results

Most of these issues were already mentioned earlier in this issue and Ben already pointed out that they hadn't been solved. Also, some of the issues are on the validator side.

I think the following were not yet mentioned:

pieterlukasse commented 8 years ago

@jjgao @n1zea144 : regarding the gene symbol/id mapping, I would like to propose the following:

Conclusion: not using the hugo symbols should solve most of the problems, and the remaining problems will be merely some warnings about missing entrez (which we can later make more informative when integrating extra data from NCBI).

pieterlukasse commented 8 years ago

I added support for "chromosomes" 23 and 24

aderidder commented 8 years ago

Hi @n1zea144 I'm currently looking into the mutsig files. Am I correctly assuming you're using the tcga MutSig2.0 files? I've been trying to generate my own mutsig files, but genepattern only supports MutSigCV (1.0, 1.1 and 1.2). Downloading MutSig from broad also doesn't provide MutSig2.0; available versions are MutSigCV 1.4 and older. From what I've seen so far, the headers are different. How do you generate your MutSig files?

n1zea144 commented 8 years ago

Hi @aderidder. Yes, MutSig 2.0 is what we use. We grab the following file from firehose:

MutSigNozzleReport2.0.Level_4:.sig_genes.txt;

pieterlukasse commented 8 years ago

hi @n1zea144 : are you also able to run 2.0 on your own data? We couldn't find a way to download/access this tool, only older versions.

n1zea144 commented 8 years ago

HI Pieter, no. I've only process the mutsig data coming out of the broad firehose.

aderidder commented 8 years ago

Hi @n1zea144 We've just discussed the clinical data with JJ and Ethan and I'm updating the documentation to fit this new standard. For the staging files could you change the meta files to the new standards? What it boils down to is: cancer_study_identifier: same value specified in meta_study.txt genetic_alteration_type: CLINICAL datatype: PATIENT_ATTRIBUTES or SAMPLE_ATTRIBUTES data_filename:

Also, from the data files, could you drop the row which specifies whether the column contains SAMPLE or PATIENT attributes? Given that we will now be having specific patient and attributes files, that row is no longer necessary.

Thanks!

morungos commented 8 years ago

Is this change (aka #798) going to be a breaking one on old data files? We have an import pipeline that does our packaging, and I need to know when I'll have to adapt.

pieterlukasse commented 8 years ago

hi @morungos, yes, our changes will require some adjustments in your pipeline. I will be summarizing the changes so that you will know what needs to be done. Our goal is to make a PR to rc in a matter of weeks now.

n1zea144 commented 8 years ago

Yes, I'll make these changes asap. I think I mentioned that ImportClinicalData has not been updated to accommodate the missing datatype row. In fact it will have to be update to support both file formats as we have a large amount of curated data using the old format.

aderidder commented 8 years ago

hi @n1zea144 are you suggesting that both the old and the new variants of clinical data should still be supported by both the validator and the loader? Yesterday we decided in a meeting with @jjgao and @ecerami to try to switch to the new format and if necessary to provide a transformation script to convert the data from the old to the new format. Would that be a workable solution?

n1zea144 commented 8 years ago

Not publicly and not the validator, but internal (MSK) and in the ImportClinicalData, probably for a little while.

aderidder commented 8 years ago

@n1zea144 and @zheins thanks for the new version!

Here's my feedback.

New Issues

Yellow Results

Red Results

Previously Existing Issues The study does not contain case lists, nor does the meta_study contain the add_global_case_list flag.

Errors:

zheins commented 8 years ago

Hi @pieterlukasse, @aderidder, I have made some changes and will be uploading the new study for you shortly. The issues with the meta data files should be resolved, and case lists are now generated.

The values containing [Not Applicable], [Not Available], etc. are handled in the java layer import code - these values come directly from firehose, they aren't generated by the conversion code.

The samples being defined twice in the samples clinical file is a result of us not using the vial number from in the TCGA barcode. Since we truncate everything after the sample level, duplicates can occur. @n1zea144, @jjgao, do we want to do anything about this?

The seg data issue is a problem with the firehose data - theres nothing we can really do besides filter it out or allow it to drop. @jjgao is having start position = end position an ok thing to happen? It just means the event affects only that one spot.

Missing values for HGVSp_Short seem to occur on alternative sequence genes or specific ORFs. As this value comes from vcf2maf annotation, there is no quick fix. Swissprot is derived from the tcga data. For these, is the desire to enter NA for blanks instead of leaving blanks?

pieterlukasse commented 8 years ago

@zheins thanks, sounds good. I'm interested in hearing more about a possible solution for the repeated samples and the seg data problem.

Regarding HGVSp_Short, maybe these need to be filtered out? Some mutations are already filtered out.

I found a new problem in the clinical data. In patient clinical data there are a few columns that are of type NUMBER but actually contain other characters. E.g. like this column:

IHC Score
IHC Score
NUMBER
1
IHC_SCORE
[Not Available]
[Not Available]
[Not Available]
2+
[Not Available]

"2+" is not a number, so maybe this column should be STRING? Or the value should be just "2"?

pieterlukasse commented 8 years ago

@n1zea144 @zheins : just tested the new BRCA file. Here is the summary of what we need to fix to be able to remove the last errors:

jjgao commented 8 years ago

@pieterlukasse @aderidder:

pieterlukasse commented 8 years ago

Hi @jjgao : you can use the test data shared by @zheins : I will send you the link

jjgao commented 8 years ago

@pieterlukasse:

The duplicated samples issue is tricky. We definitely don't want to put the vial number into the sample id, but a sample with different vial numbers may have different values of certain attribute, e.g. time of collection. We almost need a vial concept in our data model. But let's not make it too complicated for now. Maybe we could just pick the first vial for now?

If the single position segs only happen once or very few times, I would not bother to deal with them.

pieterlukasse commented 8 years ago

@jjgao : OK, we will implement your suggestions and just add a warning in both the validator and loader. Maybe for the duplicated samples it would be good to log a separate issue so that it gets solved at some point?

pieterlukasse commented 8 years ago

@jjgao : one more small issue: in RPPA data we have a few lines that start with NA|... . The loader currently interprets these as being the gene alias NA (aka as XK, entrez 7504), but we suspect that in this case it actually means Not Applicable or Not Available. If you check the rppa data in the last BRCA download shared today you will see what I mean.

jjgao commented 8 years ago

@pieterlukasse: you are right. It was a firehose error. Their gene list is not complete and therefore could not map some of the genes. Could you give a warning in this case?

pieterlukasse commented 8 years ago

@jjgao sure, we will do that

pieterlukasse commented 8 years ago

@jjgao : have you logged this bug at firehose, or do you want us to do it?

jjgao commented 8 years ago

@pieterlukasse: I haven't. Please feel free to do that.