Closed dhimmel closed 6 years ago
Small code change as seen in diff for scripts/2.TCGA-process.py
enables retaining samples that:
This increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388. So we're gaining 9 samples for cognoma.
You can see specific samples added in the diff for data/samples.tsv
. None are ovarian cancer.
In https://github.com/cognoma/cancer-data/pull/44/commits/9f9f675a328543f7cbfe2ea699fe23b979455b0e I added a diseases.tsv
file under data
with summary information for each cancer. It's useful for tracking sample numbers by cancer type for the various datasets.
are there any samples in either mutation or gene expression data that are not in the clinical data?
I used the following code:
samples_missing_clinical = sorted((gene_mutation_mat_df.index & expr_df.index).difference(clinmat_df.index))
for sample_id in samples_missing_clinical:
print(sample_id)
len(samples_missing_clinical)
Turns out there are 389 missing samples:
probably don't want to filter these samples either (we can infer its acronym.)
Hmm this seems like an upstream issue and I'd prefer an upstream fix rather than hacking it ourselves. I propose we merge this PR and deal with this issue subsequently.
Ah, this is interesting, and now that I think about it, totally expected.
If I am remembering correctly (haven't confirmed) the clinical data stores mostly 01
sample-types. 01
refers to Primary Solid Tumor
(dictionary here). So, all of the 06
tumors (Metastatic) will be dropped! Even though the clinical data should match for the patient id instead of the sample id.
I am not sure where the upstream fix of this should live. Perhaps we should investigate sample specific vs. patient specific clinical data and merge mutation/gene exp calls on patient ID after the first merge on sample_id while retaining only patient specific identifiers (age, acronym, etc.) for these samples.
I think it's simpler to not include metastatic tumors as there are not that many and they may break the independence between observation assumption of many classifiers (not sure if that really matters).
I think it's simpler to not include metastatic tumors as there are not that many
Nearly all of the melanoma tumors (SKCM) are metastatic - these will be dropped if we go this route.
Turns out there are 389 missing samples:
FYI, I think this is not true and instead results from us filtering by sample types earlier in the notebook:
Hence I gave my comment above a :-1: .
Refs https://github.com/cognoma/cancer-data/issues/43#issuecomment-380897122
Note this may not retain all samples without mutations but with sequencing. However, it does retain all samples that we're aware of via Xena that have been sequenced.