Closed dhimmel closed 8 years ago
Here's what the head of data/samples.tsv
looks like:
sample_id | patient_id | sample_type | disease | organ_of_origin | gender | age_diagnosed | dead | days_survived | recurred | days_recurrence_free |
---|---|---|---|---|---|---|---|---|---|---|
TCGA-02-0047-01 | TCGA-02-0047 | Primary Tumor | glioblastoma multiforme | Brain | Male | 78 | 1 | 448 | ||
TCGA-02-0055-01 | TCGA-02-0055 | Primary Tumor | glioblastoma multiforme | Brain | Female | 62 | 1 | 76 | ||
TCGA-02-2483-01 | TCGA-02-2483 | Primary Tumor | glioblastoma multiforme | Brain | Male | 43 | 0 | 466 | ||
TCGA-02-2485-01 | TCGA-02-2485 | Primary Tumor | glioblastoma multiforme | Brain | Male | 53 | 0 | 470 | ||
TCGA-02-2486-01 | TCGA-02-2486 | Primary Tumor | glioblastoma multiforme | Brain | Male | 64 | 0 | 493 | ||
TCGA-04-1348-01 | TCGA-04-1348 | Primary Tumor | ovarian serous cystadenocarcinoma | Ovary | Female | 44 | 1 | 1483 | ||
TCGA-04-1357-01 | TCGA-04-1357 | Primary Tumor | ovarian serous cystadenocarcinoma | Ovary | Female | 52 | ||||
TCGA-04-1362-01 | TCGA-04-1362 | Primary Tumor | ovarian serous cystadenocarcinoma | Ovary | Female | 59 | 1 | 1348 | ||
TCGA-05-4244-01 | TCGA-05-4244 | Primary Tumor | lung adenocarcinoma | Lung | Male | 70 | ||||
TCGA-05-4249-01 | TCGA-05-4249 | Primary Tumor | lung adenocarcinoma | Lung | Male | 67 | 0 | 1523 | 0 | 1523 |
TCGA-05-4250-01 | TCGA-05-4250 | Primary Tumor | lung adenocarcinoma | Lung | Female | 79 | 1 | 121 | ||
TCGA-05-4382-01 | TCGA-05-4382 | Primary Tumor | lung adenocarcinoma | Lung | Male | 68 | 0 | 607 | 1 | 334 |
Unfortunately, restricting the data to Primary Tumor
removes "Acute Myeloid Leukemia" from the dataset. AML is classified as: Primary Blood Derived Cancer - Peripheral Blood
.
Besides this, the PR looks good to me.
As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)
@gwaygenomics, amazing catch. Here's the unfiltered breakdown of disease
versus sample_type
(created using pandas.crosstab
):
disease | Additional - New Primary | Additional Metastatic | Metastatic | Primary Blood Derived Cancer - Peripheral Blood | Primary Tumor | Recurrent Tumor | Solid Tissue Normal |
---|---|---|---|---|---|---|---|
acute myeloid leukemia | 0 | 0 | 0 | 200 | 0 | 0 | 0 |
adrenocortical cancer | 0 | 0 | 0 | 0 | 92 | 0 | 0 |
bladder urothelial carcinoma | 0 | 0 | 1 | 0 | 412 | 0 | 23 |
brain lower grade glioma | 0 | 0 | 0 | 0 | 516 | 14 | 0 |
breast invasive carcinoma | 0 | 0 | 7 | 0 | 1101 | 0 | 133 |
cervical & endocervical cancer | 0 | 0 | 2 | 0 | 308 | 0 | 3 |
cholangiocarcinoma | 0 | 0 | 0 | 0 | 36 | 0 | 9 |
colon adenocarcinoma | 0 | 0 | 1 | 0 | 461 | 1 | 85 |
diffuse large B-cell lymphoma | 0 | 0 | 0 | 0 | 48 | 0 | 0 |
esophageal carcinoma | 0 | 0 | 1 | 0 | 185 | 0 | 18 |
glioblastoma multiforme | 0 | 0 | 0 | 0 | 594 | 13 | 6 |
head & neck squamous cell carcinoma | 0 | 0 | 2 | 0 | 528 | 0 | 74 |
kidney chromophobe | 0 | 0 | 0 | 0 | 66 | 0 | 25 |
kidney clear cell carcinoma | 1 | 0 | 0 | 0 | 536 | 0 | 407 |
kidney papillary cell carcinoma | 1 | 0 | 0 | 0 | 291 | 0 | 60 |
liver hepatocellular carcinoma | 0 | 0 | 0 | 0 | 377 | 2 | 59 |
lung adenocarcinoma | 0 | 0 | 0 | 0 | 520 | 2 | 120 |
lung squamous cell carcinoma | 0 | 0 | 0 | 0 | 506 | 0 | 120 |
mesothelioma | 0 | 0 | 0 | 0 | 87 | 0 | 0 |
ovarian serous cystadenocarcinoma | 0 | 0 | 0 | 0 | 592 | 19 | 12 |
pancreatic adenocarcinoma | 0 | 0 | 1 | 0 | 185 | 0 | 10 |
pheochromocytoma & paraganglioma | 3 | 0 | 2 | 0 | 179 | 0 | 3 |
prostate adenocarcinoma | 0 | 0 | 1 | 0 | 498 | 0 | 67 |
rectum adenocarcinoma | 0 | 0 | 0 | 0 | 167 | 1 | 16 |
sarcoma | 0 | 0 | 1 | 0 | 261 | 3 | 6 |
skin cutaneous melanoma | 0 | 1 | 369 | 0 | 104 | 0 | 2 |
stomach adenocarcinoma | 0 | 0 | 0 | 0 | 478 | 0 | 103 |
testicular germ cell tumor | 6 | 0 | 0 | 0 | 150 | 0 | 0 |
thymoma | 0 | 0 | 0 | 0 | 124 | 0 | 2 |
thyroid carcinoma | 0 | 0 | 8 | 0 | 507 | 0 | 65 |
uterine carcinosarcoma | 0 | 0 | 0 | 0 | 57 | 0 | 0 |
uterine corpus endometrioid carcinoma | 0 | 0 | 0 | 0 | 547 | 1 | 47 |
uveal melanoma | 0 | 0 | 0 | 0 | 80 | 0 | 0 |
@gwaygenomics any insight on what this Additional - New Primary
type refers to?
Including Additional - New Primary
causes multiple samples per patient, so scratch it.
In fb1ae155240c4383b227b86a1e20b33ce46e9fb1, samples whose type was "Primary Blood Derived Cancer - Peripheral Blood" were retained. This increased the number of samples from 10593 to 10793 while maintaining patient uniqueness.
However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?
As a general comment, while I think it is definitely good for the ML group to have a single dataset that everyone is working on, restricting it like this may not be the optimal solution. Eventually the data will need to be more fluid and subset on the fly depending on different rules which we will need to define later. (e.g. Unsupervised feature construction should not remove gene expression samples that don't have mutation status)
Yeah, saving unsubsetted datasets is always a good idea, but I've been holding out to avoid confusion and since we don't have a good LFS solution currently. I therefore advocate for adding these files on an as-needed basis.
In the meantime, here's the expression deficit were dealing with. There are 9,283 unique patients in the expression dataset. Our filtering whittles down the number of unique patients/samples to 7,306. So for some applications the 2k additional samples may matter (especially if you desire non-carcinogenic expression signatures).
Still outstanding is how to filter the full dataset to one sample per patient (or whether this is even necessary) for unsupervised feature construction.
However, it seems that 200 added "acute myeloid leukemia" samples were missing mutation or expression data, as no new samples were added to the datasets. @gwaygenomics does this sound right?
I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.
There are 9,283 unique patients in the expression dataset.
For now...I haven't been advocating much for dataset refinement because I know it will be updated once the main TCGA papers are published and protected data are unembargoed. There's definitely lots of value in what we have been doing, but the scripts will need to be rerun once new (frozen) data are released.
Let's get this merged so people can begin playing with the cleaner clinical data matrix!
I uploaded the datasets from this release to figshare: https://doi.org/10.6084/m9.figshare.3487685.v4
It looks like the filename is samples.tsv
- should it be samples.tsv.bz2
?
@gwaygenomics, samples.tsv
is not bzip2 compressed.
Regarding the lack of samples with "acute myeloid leukemia" in the aligned dataset: Of the 200 "acute myeloid leukemia" samples, 173 had expression data had 0 have mutation data. According to @gwaygenomics:
I have directly worked with this AML data before - not sure why it would be missing here. I think it may have to do with Xena version inconsistencies.
Therefore this issue sounds like #16, where a past PANCAN_mutation
version contained mutations for additional samples. CCing @jingchunzhu in case the lack of mutation data for AML samples is an upstream bug.
I assume you are using the mutationVector format data? not the gene-level non-silent mutation data.
AML data has only been called on hg18, not on hg19 genomes. The single mutationVector dataset needs to be coherent in terms of the genomic positions, all on hg19. Typically it is also not a good idea to map SNVs between genomes. It is best to re-align reads and re-call mutations on a new genome.
@jingchunzhu, we're using the PANCAN_mutation
dataset, which is variant level rather than gene level. The online documentation states:
TCGA pan-cancer somatic mutation data compiled from all TCGA cohorts where hg19 mutation calls are available.
So it makes sense that AML samples are omitted because they haven't been called on hg19.
Keeps only samples with type equal to "Primary Tumor". This filters multiple samples from the same patient, which could cause an issue for machine learning due to a dependent observations (discussed in #10). This filter reduced the number of samples with expression and mutation from 7,705 to 7,306.
Closes #10: all variables that could help with sample selection or covariates, that are in PANCAN_clinicalMatrix, are extracted to
data/samples.tsv
.Relies on documentation of PANCAN_clinicalMatrix variables provided by the Xena Browser team in #14.
Closes #17: only sample_ids with expression, mutation, and clinical data are output to
data/
.