cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Retain zero-mutation samples #44

Closed dhimmel closed 6 years ago

dhimmel commented 6 years ago

Refs https://github.com/cognoma/cancer-data/issues/43#issuecomment-380897122

Note this may not retain all samples without mutations but with sequencing. However, it does retain all samples that we're aware of via Xena that have been sequenced.

dhimmel commented 6 years ago

Small code change as seen in diff for scripts/2.TCGA-process.py enables retaining samples that:

  1. have mutation calls (potentially silent mutations)
  2. but do not not have red or blue mutations.

This increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388. So we're gaining 9 samples for cognoma.

You can see specific samples added in the diff for data/samples.tsv. None are ovarian cancer.

dhimmel commented 6 years ago

In https://github.com/cognoma/cancer-data/pull/44/commits/9f9f675a328543f7cbfe2ea699fe23b979455b0e I added a diseases.tsv file under data with summary information for each cancer. It's useful for tracking sample numbers by cancer type for the various datasets.

dhimmel commented 6 years ago

are there any samples in either mutation or gene expression data that are not in the clinical data?

I used the following code:

samples_missing_clinical = sorted((gene_mutation_mat_df.index & expr_df.index).difference(clinmat_df.index))
for sample_id in samples_missing_clinical:
    print(sample_id)
len(samples_missing_clinical)

Turns out there are 389 missing samples:

mutation or expression samples missing clinical data ``` TCGA-28-2510-01 TCGA-2G-AAKO-01 TCGA-2G-AALF-01 TCGA-2G-AALG-01 TCGA-2G-AALN-01 TCGA-2G-AALO-01 TCGA-2G-AALQ-01 TCGA-2G-AALR-01 TCGA-2G-AALS-01 TCGA-2G-AALT-01 TCGA-2G-AALW-01 TCGA-2G-AALX-01 TCGA-2G-AALY-01 TCGA-2G-AALZ-01 TCGA-2G-AAM2-01 TCGA-2G-AAM3-01 TCGA-2G-AAM4-01 TCGA-3N-A9WB-06 TCGA-3N-A9WC-06 TCGA-3N-A9WD-06 TCGA-5M-AAT5-01 TCGA-5M-AATA-01 TCGA-AR-A0U1-01 TCGA-BF-AAP0-06 TCGA-BH-A0HN-01 TCGA-C4-A0EZ-01 TCGA-C4-A0F1-01 TCGA-C4-A0F7-01 TCGA-D3-A1Q1-06 TCGA-D3-A1Q3-06 TCGA-D3-A1Q4-06 TCGA-D3-A1Q5-06 TCGA-D3-A1Q6-06 TCGA-D3-A1Q7-06 TCGA-D3-A1Q8-06 TCGA-D3-A1Q9-06 TCGA-D3-A1QA-06 TCGA-D3-A1QB-06 TCGA-D3-A2J6-06 TCGA-D3-A2J7-06 TCGA-D3-A2J8-06 TCGA-D3-A2J9-06 TCGA-D3-A2JA-06 TCGA-D3-A2JB-06 TCGA-D3-A2JC-06 TCGA-D3-A2JD-06 TCGA-D3-A2JE-06 TCGA-D3-A2JF-06 TCGA-D3-A2JG-06 TCGA-D3-A2JH-06 TCGA-D3-A2JK-06 TCGA-D3-A2JL-06 TCGA-D3-A2JN-06 TCGA-D3-A2JO-06 TCGA-D3-A2JP-06 TCGA-D3-A3BZ-06 TCGA-D3-A3C1-06 TCGA-D3-A3C3-06 TCGA-D3-A3C6-06 TCGA-D3-A3C7-06 TCGA-D3-A3C8-06 TCGA-D3-A3CB-06 TCGA-D3-A3CC-06 TCGA-D3-A3CE-06 TCGA-D3-A3CF-06 TCGA-D3-A3ML-06 TCGA-D3-A3MO-06 TCGA-D3-A3MR-06 TCGA-D3-A3MU-06 TCGA-D3-A3MV-06 TCGA-D3-A51E-06 TCGA-D3-A51F-06 TCGA-D3-A51G-06 TCGA-D3-A51H-06 TCGA-D3-A51J-06 TCGA-D3-A51K-06 TCGA-D3-A51N-06 TCGA-D3-A51R-06 TCGA-D3-A51T-06 TCGA-D3-A5GL-06 TCGA-D3-A5GN-06 TCGA-D3-A5GO-06 TCGA-D3-A5GR-06 TCGA-D3-A5GS-06 TCGA-D3-A5GU-06 TCGA-D3-A8GB-06 TCGA-D3-A8GC-06 TCGA-D3-A8GD-06 TCGA-D3-A8GE-06 TCGA-D3-A8GI-06 TCGA-D3-A8GJ-06 TCGA-D3-A8GK-06 TCGA-D3-A8GL-06 TCGA-D3-A8GM-06 TCGA-D3-A8GN-06 TCGA-D3-A8GO-06 TCGA-D3-A8GP-06 TCGA-D3-A8GQ-06 TCGA-D3-A8GR-06 TCGA-D3-A8GS-06 TCGA-D3-A8GV-06 TCGA-D9-A148-06 TCGA-D9-A149-06 TCGA-D9-A1JW-06 TCGA-D9-A1JX-06 TCGA-D9-A1X3-06 TCGA-D9-A3Z1-06 TCGA-D9-A3Z3-06 TCGA-D9-A4Z6-06 TCGA-D9-A6E9-06 TCGA-D9-A6EA-06 TCGA-D9-A6EC-06 TCGA-D9-A6EG-06 TCGA-DA-A1HV-06 TCGA-DA-A1HW-06 TCGA-DA-A1HY-06 TCGA-DA-A1I0-06 TCGA-DA-A1I1-06 TCGA-DA-A1I2-06 TCGA-DA-A1I4-06 TCGA-DA-A1I5-06 TCGA-DA-A1I7-06 TCGA-DA-A1I8-06 TCGA-DA-A1IA-06 TCGA-DA-A1IB-06 TCGA-DA-A1IC-06 TCGA-DA-A3F3-06 TCGA-DA-A3F5-06 TCGA-DA-A3F8-06 TCGA-DA-A95V-06 TCGA-DA-A95W-06 TCGA-DA-A95X-06 TCGA-DA-A95Y-06 TCGA-DA-A95Z-06 TCGA-DD-A116-01 TCGA-EB-A44Q-06 TCGA-EB-A44R-06 TCGA-EB-A5KH-06 TCGA-EB-A5SG-06 TCGA-EB-A5SH-06 TCGA-EB-A5UL-06 TCGA-EB-A5UN-06 TCGA-EB-A6L9-06 TCGA-EE-A17X-06 TCGA-EE-A17Y-06 TCGA-EE-A17Z-06 TCGA-EE-A180-06 TCGA-EE-A181-06 TCGA-EE-A182-06 TCGA-EE-A183-06 TCGA-EE-A184-06 TCGA-EE-A185-06 TCGA-EE-A20B-06 TCGA-EE-A20C-06 TCGA-EE-A20F-06 TCGA-EE-A20H-06 TCGA-EE-A20I-06 TCGA-EE-A29A-06 TCGA-EE-A29B-06 TCGA-EE-A29C-06 TCGA-EE-A29D-06 TCGA-EE-A29E-06 TCGA-EE-A29G-06 TCGA-EE-A29H-06 TCGA-EE-A29L-06 TCGA-EE-A29M-06 TCGA-EE-A29N-06 TCGA-EE-A29P-06 TCGA-EE-A29Q-06 TCGA-EE-A29R-06 TCGA-EE-A29S-06 TCGA-EE-A29T-06 TCGA-EE-A29V-06 TCGA-EE-A29W-06 TCGA-EE-A29X-06 TCGA-EE-A2A0-06 TCGA-EE-A2A1-06 TCGA-EE-A2A2-06 TCGA-EE-A2A5-06 TCGA-EE-A2A6-06 TCGA-EE-A2GB-06 TCGA-EE-A2GC-06 TCGA-EE-A2GD-06 TCGA-EE-A2GE-06 TCGA-EE-A2GH-06 TCGA-EE-A2GI-06 TCGA-EE-A2GJ-06 TCGA-EE-A2GK-06 TCGA-EE-A2GL-06 TCGA-EE-A2GM-06 TCGA-EE-A2GN-06 TCGA-EE-A2GO-06 TCGA-EE-A2GP-06 TCGA-EE-A2GR-06 TCGA-EE-A2GS-06 TCGA-EE-A2GT-06 TCGA-EE-A2GU-06 TCGA-EE-A2M5-06 TCGA-EE-A2M6-06 TCGA-EE-A2M7-06 TCGA-EE-A2M8-06 TCGA-EE-A2MC-06 TCGA-EE-A2MD-06 TCGA-EE-A2ME-06 TCGA-EE-A2MF-06 TCGA-EE-A2MG-06 TCGA-EE-A2MH-06 TCGA-EE-A2MI-06 TCGA-EE-A2MJ-06 TCGA-EE-A2MK-06 TCGA-EE-A2ML-06 TCGA-EE-A2MM-06 TCGA-EE-A2MN-06 TCGA-EE-A2MP-06 TCGA-EE-A2MQ-06 TCGA-EE-A2MR-06 TCGA-EE-A2MS-06 TCGA-EE-A2MT-06 TCGA-EE-A2MU-06 TCGA-EE-A3AA-06 TCGA-EE-A3AB-06 TCGA-EE-A3AC-06 TCGA-EE-A3AD-06 TCGA-EE-A3AE-06 TCGA-EE-A3AF-06 TCGA-EE-A3AG-06 TCGA-EE-A3AH-06 TCGA-EE-A3J3-06 TCGA-EE-A3J4-06 TCGA-EE-A3J5-06 TCGA-EE-A3J7-06 TCGA-EE-A3J8-06 TCGA-EE-A3JA-06 TCGA-EE-A3JB-06 TCGA-EE-A3JD-06 TCGA-EE-A3JE-06 TCGA-EE-A3JH-06 TCGA-EE-A3JI-06 TCGA-ER-A193-06 TCGA-ER-A195-06 TCGA-ER-A197-06 TCGA-ER-A198-06 TCGA-ER-A199-06 TCGA-ER-A19A-06 TCGA-ER-A19B-06 TCGA-ER-A19C-06 TCGA-ER-A19D-06 TCGA-ER-A19E-06 TCGA-ER-A19F-06 TCGA-ER-A19G-06 TCGA-ER-A19H-06 TCGA-ER-A19J-06 TCGA-ER-A19L-06 TCGA-ER-A19M-06 TCGA-ER-A19N-06 TCGA-ER-A19O-06 TCGA-ER-A19P-06 TCGA-ER-A19Q-06 TCGA-ER-A19S-06 TCGA-ER-A19W-06 TCGA-ER-A1A1-06 TCGA-ER-A2NC-06 TCGA-ER-A2ND-06 TCGA-ER-A2NE-06 TCGA-ER-A2NG-06 TCGA-ER-A2NH-06 TCGA-ER-A3ES-06 TCGA-ER-A3ET-06 TCGA-ER-A3EV-06 TCGA-ER-A3PL-06 TCGA-ER-A42K-06 TCGA-ER-A42L-06 TCGA-F5-6810-01 TCGA-FR-A3YN-06 TCGA-FR-A3YO-06 TCGA-FR-A44A-06 TCGA-FR-A69P-06 TCGA-FR-A729-06 TCGA-FR-A7U8-06 TCGA-FR-A7U9-06 TCGA-FR-A7UA-06 TCGA-FR-A8YC-06 TCGA-FR-A8YD-06 TCGA-FR-A8YE-06 TCGA-FS-A1YW-06 TCGA-FS-A1YX-06 TCGA-FS-A1YY-06 TCGA-FS-A1Z0-06 TCGA-FS-A1Z3-06 TCGA-FS-A1Z4-06 TCGA-FS-A1Z7-06 TCGA-FS-A1ZA-06 TCGA-FS-A1ZB-06 TCGA-FS-A1ZC-06 TCGA-FS-A1ZD-06 TCGA-FS-A1ZE-06 TCGA-FS-A1ZF-06 TCGA-FS-A1ZG-06 TCGA-FS-A1ZH-06 TCGA-FS-A1ZJ-06 TCGA-FS-A1ZK-06 TCGA-FS-A1ZM-06 TCGA-FS-A1ZP-06 TCGA-FS-A1ZQ-06 TCGA-FS-A1ZR-06 TCGA-FS-A1ZS-06 TCGA-FS-A1ZT-06 TCGA-FS-A1ZU-06 TCGA-FS-A1ZW-06 TCGA-FS-A1ZY-06 TCGA-FS-A1ZZ-06 TCGA-FS-A4F0-06 TCGA-FS-A4F4-06 TCGA-FS-A4F5-06 TCGA-FS-A4F8-06 TCGA-FS-A4F9-06 TCGA-FS-A4FB-06 TCGA-FS-A4FC-06 TCGA-FS-A4FD-06 TCGA-FW-A3I3-06 TCGA-FW-A3R5-06 TCGA-FW-A3TU-06 TCGA-FW-A3TV-06 TCGA-FW-A5DY-06 TCGA-GF-A3OT-06 TCGA-GF-A4EO-06 TCGA-GF-A6C8-06 TCGA-GF-A6C9-06 TCGA-GN-A262-06 TCGA-GN-A264-06 TCGA-GN-A265-06 TCGA-GN-A266-06 TCGA-GN-A267-06 TCGA-GN-A268-06 TCGA-GN-A26A-06 TCGA-GN-A26D-06 TCGA-GN-A4U3-06 TCGA-GN-A4U4-06 TCGA-GN-A4U7-06 TCGA-GN-A4U8-06 TCGA-GN-A4U9-06 TCGA-GN-A8LK-06 TCGA-GN-A8LL-06 TCGA-GN-A9SD-06 TCGA-HR-A2OG-06 TCGA-HR-A2OH-06 TCGA-LH-A9QB-06 TCGA-OD-A75X-06 TCGA-QB-A6FS-06 TCGA-QB-AA9O-06 TCGA-R8-A6YH-01 TCGA-RP-A690-06 TCGA-RP-A693-06 TCGA-RP-A694-06 TCGA-RP-A695-06 TCGA-RP-A6K9-06 TCGA-W3-A824-06 TCGA-W3-A825-06 TCGA-W3-A828-06 TCGA-W3-AA1O-06 TCGA-W3-AA1Q-06 TCGA-W3-AA1R-06 TCGA-W3-AA1V-06 TCGA-W3-AA1W-06 TCGA-W3-AA21-06 TCGA-WE-A8JZ-06 TCGA-WE-A8K1-06 TCGA-WE-A8K5-06 TCGA-WE-A8K6-06 TCGA-WE-A8ZM-06 TCGA-WE-A8ZN-06 TCGA-WE-A8ZO-06 TCGA-WE-A8ZQ-06 TCGA-WE-A8ZR-06 TCGA-WE-A8ZT-06 TCGA-WE-A8ZX-06 TCGA-WE-A8ZY-06 TCGA-WE-AA9Y-06 TCGA-WE-AAA0-06 TCGA-WE-AAA3-06 TCGA-WE-AAA4-06 TCGA-YD-A89C-06 TCGA-YD-A9TA-06 TCGA-YD-A9TB-06 TCGA-YG-AA3O-06 TCGA-YG-AA3P-06 TCGA-Z2-A8RT-06 TCGA-Z2-AA3S-06 TCGA-Z2-AA3V-06 ```

probably don't want to filter these samples either (we can infer its acronym.)

Hmm this seems like an upstream issue and I'd prefer an upstream fix rather than hacking it ourselves. I propose we merge this PR and deal with this issue subsequently.

gwaybio commented 6 years ago

Ah, this is interesting, and now that I think about it, totally expected.

If I am remembering correctly (haven't confirmed) the clinical data stores mostly 01 sample-types. 01 refers to Primary Solid Tumor (dictionary here). So, all of the 06 tumors (Metastatic) will be dropped! Even though the clinical data should match for the patient id instead of the sample id.

I am not sure where the upstream fix of this should live. Perhaps we should investigate sample specific vs. patient specific clinical data and merge mutation/gene exp calls on patient ID after the first merge on sample_id while retaining only patient specific identifiers (age, acronym, etc.) for these samples.

dhimmel commented 6 years ago

I think it's simpler to not include metastatic tumors as there are not that many and they may break the independence between observation assumption of many classifiers (not sure if that really matters).

gwaybio commented 6 years ago

I think it's simpler to not include metastatic tumors as there are not that many

Nearly all of the melanoma tumors (SKCM) are metastatic - these will be dropped if we go this route.

dhimmel commented 6 years ago

Turns out there are 389 missing samples:

FYI, I think this is not true and instead results from us filtering by sample types earlier in the notebook:

https://github.com/cognoma/cancer-data/blob/93e4c53dc3d58df4cf52d1a40179d62ccbc0b985/scripts/2.TCGA-process.py#L165-L171

Hence I gave my comment above a :-1: .