Extract sample info from PANCAN_clinicalMatrix

dhimmel commented 8 years ago

Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306.

Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to data/samples.tsv.

Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14.

Closes #17: only sample_ids with expression, mutation, and clinical data are output to data/.

dhimmel commented 8 years ago

Here's what the head of data/samples.tsv looks like:

sample_id	patient_id	sample_type	disease	organ_of_origin	gender	age_diagnosed	dead	days_survived	recurred	days_recurrence_free
TCGA-02-0047-01	TCGA-02-0047	Primary Tumor	glioblastoma multiforme	Brain	Male	78	1	448
TCGA-02-0055-01	TCGA-02-0055	Primary Tumor	glioblastoma multiforme	Brain	Female	62	1	76
TCGA-02-2483-01	TCGA-02-2483	Primary Tumor	glioblastoma multiforme	Brain	Male	43	0	466
TCGA-02-2485-01	TCGA-02-2485	Primary Tumor	glioblastoma multiforme	Brain	Male	53	0	470
TCGA-02-2486-01	TCGA-02-2486	Primary Tumor	glioblastoma multiforme	Brain	Male	64	0	493
TCGA-04-1348-01	TCGA-04-1348	Primary Tumor	ovarian serous cystadenocarcinoma	Ovary	Female	44	1	1483
TCGA-04-1357-01	TCGA-04-1357	Primary Tumor	ovarian serous cystadenocarcinoma	Ovary	Female	52
TCGA-04-1362-01	TCGA-04-1362	Primary Tumor	ovarian serous cystadenocarcinoma	Ovary	Female	59	1	1348
TCGA-05-4244-01	TCGA-05-4244	Primary Tumor	lung adenocarcinoma	Lung	Male	70
TCGA-05-4249-01	TCGA-05-4249	Primary Tumor	lung adenocarcinoma	Lung	Male	67	0	1523	0	1523
TCGA-05-4250-01	TCGA-05-4250	Primary Tumor	lung adenocarcinoma	Lung	Female	79	1	121
TCGA-05-4382-01	TCGA-05-4382	Primary Tumor	lung adenocarcinoma	Lung	Male	68	0	607	1	334

gwaybio commented 8 years ago

Unfortunately, restricting the data to Primary Tumor removes "Acute Myeloid Leukemia" from the dataset. AML is classified as: Primary Blood Derived Cancer - Peripheral Blood.

Besides this, the PR looks good to me.

As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)

dhimmel commented 8 years ago

@gwaygenomics, amazing catch. Here's the unfiltered breakdown of disease versus sample_type (created using pandas.crosstab):

disease	Additional - New Primary	Additional Metastatic	Metastatic	Primary Blood Derived Cancer - Peripheral Blood	Primary Tumor	Recurrent Tumor	Solid Tissue Normal
acute myeloid leukemia	0	0	0	200	0	0	0
adrenocortical cancer	0	0	0	0	92	0	0
bladder urothelial carcinoma	0	0	1	0	412	0	23
brain lower grade glioma	0	0	0	0	516	14	0
breast invasive carcinoma	0	0	7	0	1101	0	133
cervical & endocervical cancer	0	0	2	0	308	0	3
cholangiocarcinoma	0	0	0	0	36	0	9
colon adenocarcinoma	0	0	1	0	461	1	85
diffuse large B-cell lymphoma	0	0	0	0	48	0	0
esophageal carcinoma	0	0	1	0	185	0	18
glioblastoma multiforme	0	0	0	0	594	13	6
head & neck squamous cell carcinoma	0	0	2	0	528	0	74
kidney chromophobe	0	0	0	0	66	0	25
kidney clear cell carcinoma	1	0	0	0	536	0	407
kidney papillary cell carcinoma	1	0	0	0	291	0	60
liver hepatocellular carcinoma	0	0	0	0	377	2	59
lung adenocarcinoma	0	0	0	0	520	2	120
lung squamous cell carcinoma	0	0	0	0	506	0	120
mesothelioma	0	0	0	0	87	0	0
ovarian serous cystadenocarcinoma	0	0	0	0	592	19	12
pancreatic adenocarcinoma	0	0	1	0	185	0	10
pheochromocytoma & paraganglioma	3	0	2	0	179	0	3
prostate adenocarcinoma	0	0	1	0	498	0	67
rectum adenocarcinoma	0	0	0	0	167	1	16
sarcoma	0	0	1	0	261	3	6
skin cutaneous melanoma	0	1	369	0	104	0	2
stomach adenocarcinoma	0	0	0	0	478	0	103
testicular germ cell tumor	6	0	0	0	150	0	0
thymoma	0	0	0	0	124	0	2
thyroid carcinoma	0	0	8	0	507	0	65
uterine carcinosarcoma	0	0	0	0	57	0	0
uterine corpus endometrioid carcinoma	0	0	0	0	547	1	47
uveal melanoma	0	0	0	0	80	0	0

@gwaygenomics any insight on what this Additional - New Primary type refers to?

dhimmel commented 8 years ago

Including Additional - New Primary causes multiple samples per patient, so scratch it.

dhimmel commented 8 years ago

In fb1ae155240c4383b227b86a1e20b33ce46e9fb1, samples whose type was "Primary Blood Derived Cancer - Peripheral Blood" were retained. This increased the number of samples from 10593 to 10793 while maintaining patient uniqueness.

However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?

dhimmel commented 8 years ago

As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)

Yeah, saving unsubsetted datasets is always a good idea, but I've been holding out to avoid confusion and since we don't have a good LFS solution currently. I therefore advocate for adding these files on an as-needed basis.

In the meantime, here's the expression deficit were dealing with. There are 9,283 unique patients in the expression dataset. Our filtering whittles down the number of unique patients/samples to 7,306. So for some applications the 2k additional samples may matter (especially if you desire non-carcinogenic expression signatures).

Still outstanding is how to filter the full dataset to one sample per patient (or whether this is even necessary) for unsupervised feature construction.

gwaybio commented 8 years ago

However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?

I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.

There are 9,283 unique patients in the expression dataset.

For now...I haven't been advocating much for dataset refinement because I know it will be updated once the main TCGA papers are published and protected data are unembargoed. There's definitely lots of value in what we have been doing, but the scripts will need to be rerun once new (frozen) data are released.

Let's get this merged so people can begin playing with the cleaner clinical data matrix!

dhimmel commented 8 years ago

I uploaded the datasets from this release to figshare: https://doi.org/10.6084/m9.figshare.3487685.v4

gwaybio commented 8 years ago

It looks like the filename is samples.tsv - should it be samples.tsv.bz2?

dhimmel commented 8 years ago

@gwaygenomics, samples.tsv is not bzip2 compressed.

dhimmel commented 8 years ago

Regarding the lack of samples with "acute myeloid leukemia" in the aligned dataset: Of the 200 "acute myeloid leukemia" samples, 173 had expression data had 0 have mutation data. According to @gwaygenomics:

I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.

Therefore this issue sounds like #16, where a past PANCAN_mutation version contained mutations for additional samples. CCing @jingchunzhu in case the lack of mutation data for AML samples is an upstream bug.

jingchunzhu commented 8 years ago

I assume you are using the mutationVector format data? not the gene-level non-silent mutation data.

AML data has only been called on hg18, not on hg19 genomes. The single mutationVector dataset needs to be coherent in terms of the genomic positions, all on hg19. Typically it is also not a good idea to map SNVs between genomes. It is best to re-align reads and re-call mutations on a new genome.

dhimmel commented 8 years ago

@jingchunzhu, we're using the PANCAN_mutation dataset, which is variant level rather than gene level. The online documentation states:

TCGA pan-cancer somatic mutation data compiled from all TCGA cohorts where hg19 mutation calls are available.

So it makes sense that AML samples are omitted because they haven't been called on hg19.

cognoma / cancer-data

Extract sample info from PANCAN_clinicalMatrix #20