Open kurtwheeler opened 6 years ago
The following returns True
. In the latest GitHub data, samples.tsv
and mutation-matrix.tsv.bz2
are sample-aligned. This code requires a recent version of pandas that can read bz2 compressed files from URL.
import pandas
url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/samples.tsv'
sample_df = pandas.read_table(url)
url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2'
mutation_df = pandas.read_table(url)
all(mutation_df.sample_id == sample_df.sample_id)
Let me check the figshare: https://doi.org/10.6084/m9.figshare.3487685.v7.
Checksums I compute locally for cognoma/cancer-data@383668e12a80ccbcc75a4930023aed16afbd208b:
$ md5sum data/*.tsv.bz2 data/samples.tsv
32d561ba90aa6efd2cf3fc29125118fc data/expression-matrix.tsv.bz2
acd20fbc57ef4f61a515a89c09fb86a3 data/mutation-matrix.tsv.bz2
ef0b6dc40658f66934937e29c5749cca data/samples.tsv
According to the figshare API, these checksums match the v7 figshare deposits. See https://api.figshare.com/v2/articles/3487685/versions/7
So I think the sample_id
s in the raw data are aligned.
No matches from the following:
grep TCGA-04-1348-01 data/samples.tsv
grep TCGA-04-1348-01 data/mutation-matrix.tsv.bz2
So I guess the question is why is core-service requesting an ID that doesn't exist? Perhaps TCGA-04-1348-01
used to exist but no longer exists in v7?
hmm, this is interesting. I am glad that core-services caught this, even if it wasn't supposed to. I did some digging into why.
It looks like tumors with 04
acronyms (e.g. TCGA-04-1348-01) are high grade serous ovarian (HGSC) cancer tumors. HGSC has been of primary focus for me in past analyses.
path = os.path.join('download', 'Survival_SupplementalTable_S1_20171025_xena_sp.tsv.gz')
# Mapping to rename and filter columns
renamer = collections.OrderedDict([
('sample', 'sample_id'),
('_PATIENT', 'patient_id'),
('cancer type abbreviation', 'acronym'),
('__placeholder__', 'disease'),
('age_at_initial_pathologic_diagnosis', 'age_diagnosed'),
('gender', 'gender'),
('race', 'race'),
('ajcc_pathologic_tumor_stage', 'ajcc_stage'),
('clinical_stage', 'clinical_stage'),
('histological_type', 'histological_type'),
('histological_grade', 'histological_grade'),
('initial_pathologic_dx_year', 'initial_pathologic_dx_year'),
('menopause_status', 'menopause_status'),
('birth_days_to', 'birth_days_to'),
('vital_status', 'vital_status'),
('tumor_status', 'tumor_status'),
('last_contact_days_to', 'last_contact_days_to'),
('death_days_to', 'death_days_to'),
('cause_of_death', 'cause_of_death'),
('new_tumor_event_type', 'new_tumor_event_type'),
('new_tumor_event_site', 'new_tumor_event_site'),
('new_tumor_event_site_other', 'new_tumor_event_site_other'),
('new_tumor_event_dx_days_to', 'days_recurrence_free'),
('treatment_outcome_first_course', 'treatment_outcome_first_course'),
('margin_status', 'margin_status'),
('residual_tumor', 'residual_tumor'),
('_EVENT', 'event_status'),
('_TIME_TO_EVENT', 'event_days'),
('OS', 'dead'),
('OS.time', 'days_survived'),
('DSS', 'disease_specific_survival_status'),
('DSS.time', 'disease_specific_survival_days'),
('DFI', 'disease_free_interval_status'),
('DFI.time', 'disease_free_interval_days'),
('PFI', 'progression_free_interval_status'),
('PFI.time', 'progression_free_interval_days'),
('Redaction', 'Redaction')
])
clinmat_df = (
pandas.read_table(path)
.rename(columns=renamer)
.merge(disease_df, how='left')
[list(renamer.values())]
.set_index('sample_id', drop=False)
.sort_values('sample_id')
)
# Fix capitalization of gender and race
clinmat_df.gender = clinmat_df.gender.str.title()
clinmat_df.race = clinmat_df.race.str.title()
# Extract sample-type with the code dictionary
clinmat_df = clinmat_df.assign(sample_type = clinmat_df.sample_id.str[-2:])
clinmat_df.sample_type = clinmat_df.sample_type.replace(sampletype_codes_dict)
# Remove "Redacted" tumors
# These patients either withdrew consent or had genome data mismatch errors
clinmat_df = clinmat_df.query('Redaction != "Redacted"').drop("Redaction", axis=1)
# Keep only these sample types
# filters duplicate samples per patient
sample_types = {
'Primary Solid Tumor',
'Primary Blood Derived Cancer - Peripheral Blood',
}
clinmat_df.query("sample_type in @sample_types", inplace=True)
Results in:
# Get acronym value counts
clinmat_df.acronym.value_counts()
BRCA 1092
GBM 588
UCEC 547
OV 537
KIRC 536
HNSC 528
LUAD 519
LGG 515
THCA 507
LUSC 504
PRAD 498
COAD 457
STAD 443
BLCA 409
LIHC 376
CESC 307
KIRP 291
SARC 261
LAML 200
ESCA 185
PAAD 185
PCPG 179
READ 166
TGCT 134
THYM 124
SKCM 108
ACC 92
MESO 87
UVM 80
KICH 66
UCS 57
DLBC 48
CHOL 36
Name: acronym, dtype: int64
sample_ids = sorted(clinmat_df.index & gene_mutation_mat_df.index & expr_df.index)
# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]
Results in:
sample_df.acronym.value_counts()
BRCA 787
LGG 510
LUAD 508
HNSC 499
PRAD 492
THCA 489
LUSC 477
UCEC 436
STAD 411
BLCA 403
KIRC 366
LIHC 357
COAD 287
CESC 286
KIRP 280
SARC 234
ESCA 183
PCPG 178
PAAD 168
GBM 149
TGCT 128
THYM 118
SKCM 103
READ 89
MESO 81
UVM 80
ACC 79
KICH 66
UCS 57
DLBC 37
CHOL 36
OV 14
Name: acronym, dtype: int64
OV
(aka HGSC
) goes from 537 to 14!
My thought was that many of these OV tumors were filtered because they didn't have observed mutations. Indeed, here are all of the samples that were filtered from the mutation set.
sample_ids = set(clinmat_df.index) - set(gene_mutation_mat_df.index)
# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]
sample_df.acronym.value_counts()
OV 475
BRCA 303
GBM 279
LAML 200
COAD 169
KIRC 168
UCEC 100
READ 77
SARC 25
LUSC 24
HNSC 21
CESC 18
THCA 16
LIHC 14
DLBC 11
KIRP 10
PAAD 10
LUAD 7
MESO 6
TGCT 6
LGG 5
STAD 5
PRAD 5
SKCM 4
THYM 2
BLCA 2
PCPG 1
ESCA 1
Name: acronym, dtype: int64
Besides decimating the ovarian cancer sample set, this step also removes many additional tumors. My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples, but there are a bit more things to consider here. I intend on following up with this search in cognoma/cancer-data
I was wondering why our HGSC numbers were so low. Can you use the PanCancerAtlas criteria?
Ah well this explains it. I just reran the scripts without changing the locations of the data because I thought it had been updated in place.
@dhimmel can you confirm https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2
is the location we should be pulling mutation data? Actually while you're at it can you check the rest of the urls in the acquiredata script?
Please remember that I have no context for why that script is the way it is nor why some of them are coming from raw.githubusercontent.com and some are coming from figshare. Likewise I have no idea what the relation between the figshare URLs you have posted and the raw.githubusercontent.com links you used in your python code above.
I just reran the scripts without changing the locations of the data because I thought it had been updated in place.
I support always using versioning such that the raw data URLs we use never point to changed versions of the data.
why some of them are coming from raw.githubusercontent.com and some are coming from figshare
In the past, we used fighsare to store large files, which did not fit in GitHub. However, we recently updated cancer-data to use Git LFS. As such, we can now download all files from GitHub. There is no need for the figshare. Therefore, I'd suggest switching everything to versioned GitHub URLs. All the files will be in this directory.
So in other words, we should either be using ALL figshare or ALL github URLs at this point. Given that uploading to figshare requires an extra step, we may want to switch to all github URLs, which will be easiest going forward.
You can add a single commit
variable which we can change to control the source data version.
Cool, sounds good to me. I can point to that directory.
One issue with trying to use that directory: there's no diseases.tsv
. Has this file been renamed?
One issue with trying to use that directory
diseases.tsv
lives in the download directory. You can use the same commit hash, just change the path to download/diseases.tsv
.
Is there a reason that's not consistent? I can make this Just Work with that being it's location, but it involves a tad bit extra code to correct for something that seems like it may not have been intentional.
Is there a reason that's not consistent
download is for inputs that we incorporate from external locations. data is for datasets we produce. Honestly, diseases.tsv
fits in either location. Happy to move it if that will be helpful (after https://github.com/cognoma/cancer-data/pull/44 is merged).
Ah okay, I don't care much one way or another, just didn't want to write code to workaround a bug. However while testing my code for the diseases.tsv
I noticed that genes.tsv
is also missing. Has that been renamed to expression-genes.tsv
?
Currently, genes.tsv
is being retrieved from the cognoma/genes repository:
If you continue to retrieve it from cognoma/genes
, you should switch to using the version used by the version of cancer-data you consume. This commit is hardcoded in 0.genes-download.ipynb
.
However, we'll need to look into how the genes table gets used. It may be possible that mutation-genes.tsv
from cancer-data is more appropriate if we only care about genes in the context of mutations. mutation-genes.tsv
contains one row per gene, but only for genes with mutation data.
New commit for cancer-data datasets: https://github.com/cognoma/cancer-data/commit/93e4c53dc3d58df4cf52d1a40179d62ccbc0b985
Note to self, use this for diseases: https://github.com/cognoma/cancer-data/blob/master/mapping/diseases.tsv
While testing https://github.com/cognoma/core-service/pull/98 I ran into the following issue while it was trying to load the Mutations table:
Checking https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv for the string
TCGA-04
turned up nothing, so either the samples table is missing data or the mutations table has rows it shouldn't.