Staging files for TCGA according to new validation standards

pieterlukasse commented 8 years ago

The expected staging files format has changed a bit in #798 , so we need the new export of TCGA staging files to comply to that.

Logging this issue to keep track of the last issues to resolve.

pieterlukasse commented 8 years ago

@n1zea144 : these are the remaining issues in the last copy of http://cbio.mskcc.org/~grossb/brca-tcga.zip:

[x] the PATIENT_ID is now missing in clinical samples file.
[x] for brca_tcga_data_cna_hg19.seg I get: Unknown chromosome "23" : shall we map '23' to X and '24' to Y? I couldn't find any mapping in ImportCopyNumberSegmentData.java
[x] I found a new issue regarding RPPA data. According to #730 one should also provide the Z-SCORES file, next to the LOG2-VALUE file. Could you add this to the next test release of the test study you will be generating?
[ ] in multiple files there seem to be a few wrong entrez ids remaining. E.g. in data_expression_median.txt: line 10136: Gene symbol does not match given Entrez id; found in file: '(MIR16,-284)'. According to the NCBI file, the correct entrez id for MIR16 is 51573, where the symbol is GDE1 and 363E6.2|MIR16 are the synonyms
[ ] ERROR: data_methylation_hm450.txt (also similar problem in data_linear_CNA.txt): lines [6450, 7537, 12322, (2 more)]: Entrez gene id not known to the cBioPortal instance.; found in file: ['105376839', '105378503', '654780', '(2 more)']. I checked these three examples and indeed none of them has a valid Hugo Symbol (column 10 of the NCBI file), and therefore are not present in the DB. How to deal with these issues? Maybe #805 needs to be adjusted to use the alias or symbol column (column 2) from the NCBI file until this entrez ID gets an official symbol?
[ ] A few data lines refer to (what is now) a gene alias instead of a gene symbol: ERROR: data_mutations_extended.txt: lines [51, 2053]: Gene alias (B3GNT1) maps to multiple Entrez ids (10678/11041), please specify which one you mean
[ ] Some gene symbols could not be found, also not in the NCBI file (e.g. I checked RP11-114H24.4): ERROR: data_mutations_extended.txt: lines [180, 443, 509, (470 more)]: Gene symbol not known to the cBioPortal instance.; found in file: ['RP11-114H24.4', 'RP11-330M2.4', 'AC090825.1', '(182 more)']
[x] the column Entrez_Gene_Id has '0' values in mutations file. Our validator expects empty cell in this case, otherwise it will check if the given Hugo symbol matches the Entrez id '0' , which is not the case.
[ ] there is also a remaining issue in CNA data that is apparently a result of a bug, see #844

pieterlukasse commented 8 years ago

@aderidder : can you help review this? @ecerami : this is where you can follow the progress of the staging files effort :arrow_up:

n1zea144 commented 8 years ago

I've addressed the issues I can in the immediate future.

@jjgao has code in genomic-overview.js to map chr 23->X, 24->Y

I'm not sure how to answer many/most of the issues related to gene symbol/id mapping. One issue is a difference in gene info between the tcga converter and the info stored in the target cbioportal database. Another issue is hugo symbols coming out of the broad firehose that map to multiple entrez ids or to entries in NCBI file that have no associated gene symbols. For now the only thing may be to flag them?

I've uploaded a new brca-tcga.zip file to the web server for more testing.

aderidder commented 8 years ago

@n1zea144 thanks for the new version with the fixes!

Here's my feedback. For this test I removed: gistic and mutsig files as they are not validated yet. The study does not contain case lists, nor does the meta_study contain the add_global_case_list flag.

First validation

The validator fails immediately due to the fact that clinical data is in two files, which is not yet supported by the validator. Already described in #808
The meta clinical files do not match the required format. The meta file e.g. contains "show_profile_in_analysis_tab", which is not allowed.

I fixed these two issues by merging the files and creating a new meta file. I ran into another issue whilst merging: There are several patients which are in the samples file, but not in the patients file (e.g. TCGA-E2-A1IP).

Second validation Green Results

General
data_rppa.txt
data_rppa_Zscores.txt
data_methylation_hm27.txt
data_RNA_Seq_v2_expression_median.txt
data_RNA_Seq_v2_expression_median_Zscores.txt

Yellow Results

data_bcr_clinical_data_merged.txt: due to me having deleted all the clinical attributes and the validator warning me that many new ones will be created.

Red Results

brca_tcga_data_cna_hg19.seg
- Start position is not lower than end position (e.g. 101050704/101050704)
- Unknown chromosome (23)
data_CNA.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -1035)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_linear_CNA.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -1035)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_expression_median.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -284)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_methylation_hm450.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 105376839, 105378503)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_mutations_extended.txt
- Value in column 'SWISSPROT' is invalid (empty)
- Already described in #806; the validator should only give a warning
- Value in column 'HGVSp_Short' is invalid (empty)
- Already described in #806; the validator should only give a warning, except when Variant_Classification in ["Splice_Site", ...].
- Value in column 'Protein_position' is invalid (empty)
- Already described in #806; the validator should not give a warning here
- Gene symbol not known to the cBioPortal instance (e.g. RP11-114H24.4, RP11-330M2.4)
- Entrez gene id not known to the cBioPortal instance (e.g. 100124696, 654780)

Most of these issues were already mentioned earlier in this issue and Ben already pointed out that they hadn't been solved. Also, some of the issues are on the validator side.

I think the following were not yet mentioned:

[x] Add add_global_case_list flag to the meta_study
[x] Update the clinical meta file(s), see: https://github.com/thehyve/cbioportal/wiki/File-Formats
[x] Shouldn't the patient file contain all the patient identifiers? If so, add the missing patient identifiers to the patient file.
[ ] Seg file, start position is not lower than end position (e.g. 101050704/101050704)

pieterlukasse commented 8 years ago

@jjgao @n1zea144 : regarding the gene symbol/id mapping, I would like to propose the following:

use the Entrez_Id column as much as possible and leave the Hugo Symbol column out (or empty/NA when the format requires the column to be there, like MAF). Entrez_Id should be unambiguous and will only give some problems when entrez_ids are revoked by NCBI (and I'm hoping this can *later be solved by using info from ftp://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz). For now these rows can be flagged with a warning and skipped during import.
regarding "hugo symbols coming out of the broad firehose that map to multiple entrez ids", this should not happen and was in fact fixed in #805. What does happen is that a gene alias can map to multiple entrez ids. This ambiguous situation is prevented when using Entrez ids as proposed above. In all other cases, if entrez Id is not used and a gene alias is still used, it is currently flagged with a warning by the validator and should not be a blocking issue (although this specific row should not be imported).
for all the cases where both the hugo symbol and entrez id are provided, we currently check for an existing link between both values, either as an official symbol or as an alias (see also https://github.com/cBioPortal/cbioportal/issues/823#issuecomment-181416643), and give an error if there is a mismatch. We could change this into a warning, but I'm not sure if this is a wise thing to do. Please let me know your comments.

Conclusion: not using the hugo symbols should solve most of the problems, and the remaining problems will be merely some warnings about missing entrez (which we can later make more informative when integrating extra data from NCBI).

pieterlukasse commented 8 years ago

I added support for "chromosomes" 23 and 24

aderidder commented 8 years ago

Hi @n1zea144 I'm currently looking into the mutsig files. Am I correctly assuming you're using the tcga MutSig2.0 files? I've been trying to generate my own mutsig files, but genepattern only supports MutSigCV (1.0, 1.1 and 1.2). Downloading MutSig from broad also doesn't provide MutSig2.0; available versions are MutSigCV 1.4 and older. From what I've seen so far, the headers are different. How do you generate your MutSig files?

n1zea144 commented 8 years ago

Hi @aderidder. Yes, MutSig 2.0 is what we use. We grab the following file from firehose:

MutSigNozzleReport2.0.Level_4:.sig_genes.txt;

pieterlukasse commented 8 years ago

hi @n1zea144 : are you also able to run 2.0 on your own data? We couldn't find a way to download/access this tool, only older versions.

n1zea144 commented 8 years ago

HI Pieter, no. I've only process the mutsig data coming out of the broad firehose.

aderidder commented 8 years ago

Hi @n1zea144 We've just discussed the clinical data with JJ and Ethan and I'm updating the documentation to fit this new standard. For the staging files could you change the meta files to the new standards? What it boils down to is: cancer_study_identifier: same value specified in meta_study.txt genetic_alteration_type: CLINICAL datatype: PATIENT_ATTRIBUTES or SAMPLE_ATTRIBUTES data_filename:

Also, from the data files, could you drop the row which specifies whether the column contains SAMPLE or PATIENT attributes? Given that we will now be having specific patient and attributes files, that row is no longer necessary.

Thanks!

morungos commented 8 years ago

Is this change (aka #798) going to be a breaking one on old data files? We have an import pipeline that does our packaging, and I need to know when I'll have to adapt.

pieterlukasse commented 8 years ago

hi @morungos, yes, our changes will require some adjustments in your pipeline. I will be summarizing the changes so that you will know what needs to be done. Our goal is to make a PR to rc in a matter of weeks now.

n1zea144 commented 8 years ago

Yes, I'll make these changes asap. I think I mentioned that ImportClinicalData has not been updated to accommodate the missing datatype row. In fact it will have to be update to support both file formats as we have a large amount of curated data using the old format.

aderidder commented 8 years ago

hi @n1zea144 are you suggesting that both the old and the new variants of clinical data should still be supported by both the validator and the loader? Yesterday we decided in a meeting with @jjgao and @ecerami to try to switch to the new format and if necessary to provide a transformation script to convert the data from the old to the new format. Would that be a workable solution?

n1zea144 commented 8 years ago

Not publicly and not the validator, but internal (MSK) and in the ImportClinicalData, probably for a little while.

aderidder commented 8 years ago

@n1zea144 and @zheins thanks for the new version!

Here's my feedback.

New Issues

Yellow Results

meta_bcr_clinical_patient.txt
- the meta file has more fields than necessary: show_profile_in_analysis_tab, profile_name, profile_description should be removed
meta_bcr_clinical_sample.txt
- the meta file has more fields than necessary: show_profile_in_analysis_tab, profile_name, profile_description should be removed
meta_mutsig.txt
- the meta file has more fields than necessary: show_profile_in_analysis_tab, profile_name, profile_description should be removed

Red Results

meta_gistic_genes_amp
- Missing field 'reference_genome_id' in meta file
meta_gistic_genes_del
- Missing field 'reference_genome_id' in meta file
data_bcr_clinical_data_patient.txt
- Lots of columns have invalid number values: [Not Applicable], [Not Available], etc.
- columns: 15, 35, 46, (14 more)
data_bcr_clinical_data_sample.txt
- Lots of columns have invalid number values: [Not Applicable], [Not Available], etc.
- columns: 8, 10, 18, (2 more)
- Sample defined twice in clinical file TCGA-A7-A13G-01, TCGA-B6-A1KC-01, TCGA-A7-A26E-01, (14 more)

Previously Existing Issues The study does not contain case lists, nor does the meta_study contain the add_global_case_list flag.

Errors:

brca_tcga_data_cna_hg19.seg
- Start position is not lower than end position (e.g. 101050704/101050704)
data_CNA.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -1035)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_linear_CNA.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -1035)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_expression_median.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 654780, -284)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_methylation_hm450.txt
- Entrez gene id not known to the cBioPortal instance (e.g. 105376839, 105378503)
- Warning: Gene symbol and Entrez identifier do not match, the symbol will be ignored (TBC1D7,107080638)
data_mutations_extended.txt
- Errors:
  - Gene symbol not known to the cBioPortal instance (e.g. AC009365.3, RP11-407N17.3, AL035406.1)
  - Entrez gene id not known to the cBioPortal instance (e.g. 100124696, 654780)
- Warnings:
  - Value in column 'SWISSPROT' is invalid (empty)
  - Value in column 'HGVSp_Short' is invalid (empty)

zheins commented 8 years ago

Hi @pieterlukasse, @aderidder, I have made some changes and will be uploading the new study for you shortly. The issues with the meta data files should be resolved, and case lists are now generated.

The values containing [Not Applicable], [Not Available], etc. are handled in the java layer import code - these values come directly from firehose, they aren't generated by the conversion code.

The samples being defined twice in the samples clinical file is a result of us not using the vial number from in the TCGA barcode. Since we truncate everything after the sample level, duplicates can occur. @n1zea144, @jjgao, do we want to do anything about this?

The seg data issue is a problem with the firehose data - theres nothing we can really do besides filter it out or allow it to drop. @jjgao is having start position = end position an ok thing to happen? It just means the event affects only that one spot.

Missing values for HGVSp_Short seem to occur on alternative sequence genes or specific ORFs. As this value comes from vcf2maf annotation, there is no quick fix. Swissprot is derived from the tcga data. For these, is the desire to enter NA for blanks instead of leaving blanks?

pieterlukasse commented 8 years ago

@zheins thanks, sounds good. I'm interested in hearing more about a possible solution for the repeated samples and the seg data problem.

Regarding HGVSp_Short, maybe these need to be filtered out? Some mutations are already filtered out.

I found a new problem in the clinical data. In patient clinical data there are a few columns that are of type NUMBER but actually contain other characters. E.g. like this column:

IHC Score
IHC Score
NUMBER
1
IHC_SCORE
[Not Available]
[Not Available]
[Not Available]
2+
[Not Available]

"2+" is not a number, so maybe this column should be STRING? Or the value should be just "2"?

pieterlukasse commented 8 years ago

@n1zea144 @zheins : just tested the new BRCA file. Here is the summary of what we need to fix to be able to remove the last errors:

data clinical patient: fix the columns where values such as "2+" are found (see my last comment above :arrow_up: )
data clinical sample: solve issue that causes samples to be defined twice in file (as you commented before, we need input from @jjgao on this topic)
data expression median: apparently there are some columns with the literal value 'null'. Seems like a bug.
seg hg19 data: here we also need some input from @jjgao on how to solve the issue where start position = end position

jjgao commented 8 years ago

@pieterlukasse @aderidder:

Could you give me example of the duplicated samples in the clinical data?
About the seg data, how frequently single position segs happens?

pieterlukasse commented 8 years ago

Hi @jjgao : you can use the test data shared by @zheins : I will send you the link

jjgao commented 8 years ago

@pieterlukasse:

The duplicated samples issue is tricky. We definitely don't want to put the vial number into the sample id, but a sample with different vial numbers may have different values of certain attribute, e.g. time of collection. We almost need a vial concept in our data model. But let's not make it too complicated for now. Maybe we could just pick the first vial for now?

If the single position segs only happen once or very few times, I would not bother to deal with them.

pieterlukasse commented 8 years ago

@jjgao : OK, we will implement your suggestions and just add a warning in both the validator and loader. Maybe for the duplicated samples it would be good to log a separate issue so that it gets solved at some point?

pieterlukasse commented 8 years ago

@jjgao : one more small issue: in RPPA data we have a few lines that start with NA|... . The loader currently interprets these as being the gene alias NA (aka as XK, entrez 7504), but we suspect that in this case it actually means Not Applicable or Not Available. If you check the rppa data in the last BRCA download shared today you will see what I mean.

jjgao commented 8 years ago

@pieterlukasse: you are right. It was a firehose error. Their gene list is not complete and therefore could not map some of the genes. Could you give a warning in this case?

pieterlukasse commented 8 years ago

@jjgao sure, we will do that

pieterlukasse commented 8 years ago

@jjgao : have you logged this bug at firehose, or do you want us to do it?

jjgao commented 8 years ago

@pieterlukasse: I haven't. Please feel free to do that.

cBioPortal / cbioportal

Staging files for TCGA according to new validation standards #839