cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Extract sample info from PANCAN_clinicalMatrix #20

Closed dhimmel closed 8 years ago

dhimmel commented 8 years ago

Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306.

Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to data/samples.tsv.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14.

Closes #17: only sample_ids with expression, mutation, and clinical data are output to data/.

dhimmel commented 8 years ago

Here's what the head of data/samples.tsv looks like:

sample_id patient_id sample_type disease organ_of_origin gender age_diagnosed dead days_survived recurred days_recurrence_free
TCGA-02-0047-01 TCGA-02-0047 Primary Tumor glioblastoma multiforme Brain Male 78 1 448
TCGA-02-0055-01 TCGA-02-0055 Primary Tumor glioblastoma multiforme Brain Female 62 1 76
TCGA-02-2483-01 TCGA-02-2483 Primary Tumor glioblastoma multiforme Brain Male 43 0 466
TCGA-02-2485-01 TCGA-02-2485 Primary Tumor glioblastoma multiforme Brain Male 53 0 470
TCGA-02-2486-01 TCGA-02-2486 Primary Tumor glioblastoma multiforme Brain Male 64 0 493
TCGA-04-1348-01 TCGA-04-1348 Primary Tumor ovarian serous cystadenocarcinoma Ovary Female 44 1 1483
TCGA-04-1357-01 TCGA-04-1357 Primary Tumor ovarian serous cystadenocarcinoma Ovary Female 52
TCGA-04-1362-01 TCGA-04-1362 Primary Tumor ovarian serous cystadenocarcinoma Ovary Female 59 1 1348
TCGA-05-4244-01 TCGA-05-4244 Primary Tumor lung adenocarcinoma Lung Male 70
TCGA-05-4249-01 TCGA-05-4249 Primary Tumor lung adenocarcinoma Lung Male 67 0 1523 0 1523
TCGA-05-4250-01 TCGA-05-4250 Primary Tumor lung adenocarcinoma Lung Female 79 1 121
TCGA-05-4382-01 TCGA-05-4382 Primary Tumor lung adenocarcinoma Lung Male 68 0 607 1 334
gwaybio commented 8 years ago

Unfortunately, restricting the data to Primary Tumor removes "Acute Myeloid Leukemia" from the dataset. AML is classified as: Primary Blood Derived Cancer - Peripheral Blood.

Besides this, the PR looks good to me.

As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)

dhimmel commented 8 years ago

@gwaygenomics, amazing catch. Here's the unfiltered breakdown of disease versus sample_type (created using pandas.crosstab):

disease Additional - New Primary Additional Metastatic Metastatic Primary Blood Derived Cancer - Peripheral Blood Primary Tumor Recurrent Tumor Solid Tissue Normal
acute myeloid leukemia 0 0 0 200 0 0 0
adrenocortical cancer 0 0 0 0 92 0 0
bladder urothelial carcinoma 0 0 1 0 412 0 23
brain lower grade glioma 0 0 0 0 516 14 0
breast invasive carcinoma 0 0 7 0 1101 0 133
cervical & endocervical cancer 0 0 2 0 308 0 3
cholangiocarcinoma 0 0 0 0 36 0 9
colon adenocarcinoma 0 0 1 0 461 1 85
diffuse large B-cell lymphoma 0 0 0 0 48 0 0
esophageal carcinoma 0 0 1 0 185 0 18
glioblastoma multiforme 0 0 0 0 594 13 6
head & neck squamous cell carcinoma 0 0 2 0 528 0 74
kidney chromophobe 0 0 0 0 66 0 25
kidney clear cell carcinoma 1 0 0 0 536 0 407
kidney papillary cell carcinoma 1 0 0 0 291 0 60
liver hepatocellular carcinoma 0 0 0 0 377 2 59
lung adenocarcinoma 0 0 0 0 520 2 120
lung squamous cell carcinoma 0 0 0 0 506 0 120
mesothelioma 0 0 0 0 87 0 0
ovarian serous cystadenocarcinoma 0 0 0 0 592 19 12
pancreatic adenocarcinoma 0 0 1 0 185 0 10
pheochromocytoma & paraganglioma 3 0 2 0 179 0 3
prostate adenocarcinoma 0 0 1 0 498 0 67
rectum adenocarcinoma 0 0 0 0 167 1 16
sarcoma 0 0 1 0 261 3 6
skin cutaneous melanoma 0 1 369 0 104 0 2
stomach adenocarcinoma 0 0 0 0 478 0 103
testicular germ cell tumor 6 0 0 0 150 0 0
thymoma 0 0 0 0 124 0 2
thyroid carcinoma 0 0 8 0 507 0 65
uterine carcinosarcoma 0 0 0 0 57 0 0
uterine corpus endometrioid carcinoma 0 0 0 0 547 1 47
uveal melanoma 0 0 0 0 80 0 0

@gwaygenomics any insight on what this Additional - New Primary type refers to?

dhimmel commented 8 years ago

Including Additional - New Primary causes multiple samples per patient, so scratch it.

dhimmel commented 8 years ago

In fb1ae155240c4383b227b86a1e20b33ce46e9fb1, samples whose type was "Primary Blood Derived Cancer - Peripheral Blood" were retained. This increased the number of samples from 10593 to 10793 while maintaining patient uniqueness.

However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?

dhimmel commented 8 years ago

As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)

Yeah, saving unsubsetted datasets is always a good idea, but I've been holding out to avoid confusion and since we don't have a good LFS solution currently. I therefore advocate for adding these files on an as-needed basis.

In the meantime, here's the expression deficit were dealing with. There are 9,283 unique patients in the expression dataset. Our filtering whittles down the number of unique patients/samples to 7,306. So for some applications the 2k additional samples may matter (especially if you desire non-carcinogenic expression signatures).

Still outstanding is how to filter the full dataset to one sample per patient (or whether this is even necessary) for unsupervised feature construction.

gwaybio commented 8 years ago

However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?

I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.

There are 9,283 unique patients in the expression dataset.

For now...I haven't been advocating much for dataset refinement because I know it will be updated once the main TCGA papers are published and protected data are unembargoed. There's definitely lots of value in what we have been doing, but the scripts will need to be rerun once new (frozen) data are released.

Let's get this merged so people can begin playing with the cleaner clinical data matrix!

dhimmel commented 8 years ago

I uploaded the datasets from this release to figshare: https://doi.org/10.6084/m9.figshare.3487685.v4

gwaybio commented 8 years ago

It looks like the filename is samples.tsv - should it be samples.tsv.bz2?

dhimmel commented 8 years ago

@gwaygenomics, samples.tsv is not bzip2 compressed.

dhimmel commented 8 years ago

Regarding the lack of samples with "acute myeloid leukemia" in the aligned dataset: Of the 200 "acute myeloid leukemia" samples, 173 had expression data had 0 have mutation data. According to @gwaygenomics:

I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.

Therefore this issue sounds like #16, where a past PANCAN_mutation version contained mutations for additional samples. CCing @jingchunzhu in case the lack of mutation data for AML samples is an upstream bug.

jingchunzhu commented 8 years ago

​I assume you are using the mutationVector format data? not the gene-level non-silent mutation data.

AML data has only been called on hg18, not on hg19 genomes. The single mutationVector dataset needs to be coherent in terms of the genomic positions, all on hg19. Typically it is also not a good idea to map SNVs between genomes. It is best to re-align reads and re-call mutations on a new genome.

dhimmel commented 8 years ago

@jingchunzhu, we're using the PANCAN_mutation dataset, which is variant level rather than gene level. The online documentation states:

TCGA pan-cancer somatic mutation data compiled from all TCGA cohorts where hg19 mutation calls are available.

So it makes sense that AML samples are omitted because they haven't been called on hg19.