Filtering samples is (potentially) too strict

gwaybio commented 6 years ago

Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.

@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.

I outlined current problems with the data in https://github.com/cognoma/core-service/issues/99#issuecomment-380876551 but we can continue this discussion here.

dhimmel commented 6 years ago

So the issue is that:

tumors were filtered because they didn't have observed mutations

My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples

I think it's a mistake to remove samples because they have no mutations, as long as we know those samples are cancers (and not normal tissues... which I think we do). On the other hand, it was presumably me who implemented this filter. So why would I have done something like that (which seems now to be throwing away good data)?

Samples without any mutations can never be positives in the Cognoma ML framework... but they are important negatives. The fact that you can get the cancer without a mutation is of course something that we should model and not ignore.

cgreene commented 6 years ago

What is a "red" mutation?

dhimmel commented 6 years ago

What is a "red" mutation?

A mutation Xena considers to be severe. https://github.com/cognoma/cancer-data/issues/2#issuecomment-233088248

From http://xena.ucsc.edu/how-we-characterize-mutations/

Red (=1) --> indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, inframe indels) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene

cgreene commented 6 years ago

Oh - that is extremely conservative. Point mutations don't make it in (basically all the activating Ras mutations are point mutations). Does cognoma actually work for Ras? We should at least include Red and Blue.

gwaybio commented 6 years ago

We do include both Red and Blue mutations - my mistake

cgreene commented 6 years ago

Are we absolutely sure of that? I would find it quite implausible that there are no more than 14 HGSCs with at least one missense mutation. 95% of them are TP53 mutated, right?

dhimmel commented 6 years ago

We do include both Red and Blue mutations - my mistake

The source code says:

https://github.com/cognoma/cancer-data/blob/383668e12a80ccbcc75a4930023aed16afbd208b/scripts/2.TCGA-process.py#L245-L261

cgreene commented 6 years ago

I think it's likely that someone's going to have to walk through this to verify that mutations is being used as intended. Just looking at TP53 alone, there look like there should be more than that: http://www.cbioportal.org/index.do?cancer_study_id=ov_tcga_pub&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=ov_tcga_pub_cna_seq&gene_list=TP53&geneset_list=+&tab_index=tab_visualize&Action=Submit&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=ov_tcga_pub_mutations&genetic_profile_ids_PROFILE_COPY_NUMBER_ALTERATION=ov_tcga_pub_gistic

dhimmel commented 6 years ago

Take a look at the source for constructing the mutation matrix:

https://github.com/cognoma/cancer-data/blob/383668e12a80ccbcc75a4930023aed16afbd208b/scripts/2.TCGA-process.py#L288-L295

So the reason we exclude samples with no mutations is because unless a sample has a single mutation, we don't actually know whether it has sample calls. mc3.v0.2.8.PUBLIC.xena.tsv.gz only contains mutations and does not include any recognition of sequenced samples with zero mutations.

@gwaygenomics do you know a workaround?

dhimmel commented 6 years ago

According to 2.TCGA-process.ipynb the mutations that are excluded (exclusion by omission of inclusion) are the following types:

{"3'Flank", "3'UTR", "5'Flank", "5'UTR", 'Intron', 'Silent', 'large deletion'}

Will chat with @gwaygenomics re there's a cBioPortal discrepancy.

gwaybio commented 6 years ago

just chatted with @dhimmel

Just looking at TP53 alone, there look like there should be more than that:

We agree - definitely hinting at something being up. We also noticed the addition of a precompiled binary matrix file. This appears to be a new addition to xena. Need to explore further, but this may save us from needing to process ourselves

gwaybio commented 6 years ago

Ok - in this binary matrix from Xena, they do rescue many OV samples.

import pandas as pd

xena_binary = pd.read_table('mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz', sep='\t', index_col=0)

# This clinical matrix as processed in https://github.com/cognoma/core-service/issues/99#issuecomment-380876551
ov_samples = clinmat_df.query("acronym == 'OV'").index.tolist()
ov_xena_df = xena_binary.loc[:, ov_samples].dropna(axis='columns')
ov_xena_df.shape

(40543, 62)

And, as a test, the TP53 counts look on target:

ov_xena_df.loc['TP53', :].value_counts()

1    54
0     8
Name: TP53, dtype: int64

I presume that this will rescue many other samples from other cancer-types as well.

cgreene commented 6 years ago

Still only 62 samples that make it through? That still seems incredibly low. This means - if I understand correctly - that what we are saying is that there are hundreds of ovarian cancers with no mutations in the blue and red category. Am I understanding this correctly?

cgreene commented 6 years ago

Oh - wait - as I'm thinking about it - are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic? I think there was a paper on this. Are the dropouts for other cancers as bad?

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

gwaybio commented 6 years ago

are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic?

Yeah, I think this is part of the reason why they're filtered (quite stringently) here.

Are the dropouts for other cancers as bad?

I will have to check exact numbers when I'm back at my desk, but I do think it impacted other cancer-types. Although I think OV will end up being the most drastic.

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

We dropped OV from training because of the TP53 status imbalance, but we were still able to make predictions on the full gene expression dataset. See Figure S6 of that paper. Our predictions align with the cBioPortal link posted previously in this thread!

gwaybio commented 6 years ago

After thinking for a bit, I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets (as we had been doing previously). This is simpler, reduces processing requirements, and contains high confidence calls. We will also need to emphasize where the data is coming from and how its processed on the cognoma homepage, and also return downloading scripts when the classifier is emailed back to the user.

The alternative would be to include less confident calls as mutation events, which, if I am remembering correctly, we did in Figure S6. This is a legitimate option since it retains more samples, and there is some (although less confident) evidence the mutations are real in the sample. As @dhimmel pointed out, it would be better to throw these samples out (taking the intersection of datasets) than to assume they have zero mutations.

dhimmel commented 6 years ago

Just looking at TP53 alone, there look like there should be more than that

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Note that we make complete data available, but it doesn't help with cognoma classifiers.

I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets

One issue IIRC with the binary matrix is that it requires us to map to symbols to entrez gene IDs without chromosome information, which reduces our ability to map.

dhimmel commented 6 years ago

I think it's a mistake to remove samples because they have no mutations

The Xena matrix mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz contains 9,104 samples with mutation calls. A quick summary of the number of mutations per sample is below:

>>> import pandas
>>> url = 'https://pancanatlas.xenahubs.net/download/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz'
>>> df = pandas.read_table(url, index_col='sample')
>>> df.sum(axis='rows').describe()
count    9104.000000
mean      171.219025
std       519.085213
min         0.000000
25%        26.000000
50%        55.000000
75%       126.000000
max      8354.000000

Hence, some samples have zero mutations in this dataset.

According to 2.TCGA-process.ipynb, we identify 9104 samples in mc3.v0.2.8.PUBLIC.xena.tsv.gz. I'm preparing a pull request to keep samples with no blue or red mutations in our mutation dataset. This will increase the complete mutation matrix to 9104 samples from 9093 previously. Thus the impact is small, but worth fixing I think.

gwaybio commented 6 years ago

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Is this based on the previous expression dataset processing? If I remember correctly, we were removing samples with NA values and many ovarian samples had some.

dhimmel commented 6 years ago

Is this based on the previous expression dataset processing?

No on the current. See the latest diseases.tsv: n_samples = 14 and n_mutation_samples = 62 for ovarian serous cystadenocarcinoma.

gwaybio commented 6 years ago

Alright, so after thinking some more about this (and based on input from @cgreene ) we should decide to process mutation data for cognoma based on what we think is the right answer, plus what is maintainable. Our options, as far as I see them (from least to most conservative) are:

All public (non germline) MC3 mutation Calls without filtration. The data is posted in the GDC.
All public (non germline) MC3 mutation Calls with Xena "red" and "blue" filtration. When creating the binary matrix, a sample x gene pair is considered mutated if there is any evidence that there is a "deleterious mutation".
- This permits some estimation of OV and LAML mutation calls. Without this step, nearly all OV and all LAML samples are removed
All public (non germline) MC3 mutation calls with Xena applied "PASS" filter. This data is posted as a binary expression matrix in Xena here.
All public (non germline) MC3 mutation calls with Pass filter and "red" and "blue" filter. This is how cognoma is currently filtering.

After chatting with @dhimmel - all are valid options (will depend on project hypotheses and input genes to be classified) but many will require additional maintenance overhead. We are not tied to Xena data, but removing this dependency will require substantial additional processing.

Since it is certainly valid to retain our current mechanism of creating a high confidence true positive binary matrix, and it requires the least amount of maintenance, I think we agreed to keep cognoma data this way for now. @dhimmel, is this an accurate description?

dhimmel commented 6 years ago

Thanks @gwaygenomics for breaking down the four filtering options. I think it's helpful to understand what processing steps have gone into our mutation matrix. While different use cases will prefer different levels of processing, I think our current implementation of high-confidence calls (pass filter) with probably effects (red or blue) is safe and versatile. By versatile, I mean suited to many downstream applications including Cognoma classifiers. By safe, I mean likely to avoid certain false conclusions, like associations with low-quality calls or silent mutations.

These decisions can always be revisited, but without a clear evidence and demand from a downstream analysis to do so, I think our time is better spent elsewhere. Thus, I'll close this issue, and we can discuss the potential inclusion of metastases in https://github.com/cognoma/cancer-data/issues/46.

cognoma / cancer-data

Filtering samples is (potentially) too strict #43