cognoma / core-service

Cognoma Core API
Other
9 stars 12 forks source link

Current dataset seems to have an issue #99

Open kurtwheeler opened 6 years ago

kurtwheeler commented 6 years ago

While testing https://github.com/cognoma/core-service/pull/98 I ran into the following issue while it was trying to load the Mutations table:

django.db.utils.IntegrityError: insert or update on table "mutations" violates foreign key constraint "mutations_sample_id_6a37562c_fk_samples_sample_id"
DETAIL:  Key (sample_id)=(TCGA-04-1348-01) is not present in table "samples".

Checking https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv for the string TCGA-04 turned up nothing, so either the samples table is missing data or the mutations table has rows it shouldn't.

dhimmel commented 6 years ago

The following returns True. In the latest GitHub data, samples.tsv and mutation-matrix.tsv.bz2 are sample-aligned. This code requires a recent version of pandas that can read bz2 compressed files from URL.

import pandas

url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/samples.tsv'
sample_df = pandas.read_table(url)

url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2'
mutation_df = pandas.read_table(url)

all(mutation_df.sample_id == sample_df.sample_id)

Let me check the figshare: https://doi.org/10.6084/m9.figshare.3487685.v7.

dhimmel commented 6 years ago

Checksums I compute locally for cognoma/cancer-data@383668e12a80ccbcc75a4930023aed16afbd208b:

$ md5sum data/*.tsv.bz2 data/samples.tsv 
32d561ba90aa6efd2cf3fc29125118fc  data/expression-matrix.tsv.bz2
acd20fbc57ef4f61a515a89c09fb86a3  data/mutation-matrix.tsv.bz2
ef0b6dc40658f66934937e29c5749cca  data/samples.tsv

According to the figshare API, these checksums match the v7 figshare deposits. See https://api.figshare.com/v2/articles/3487685/versions/7

So I think the sample_ids in the raw data are aligned.

dhimmel commented 6 years ago

No matches from the following:

grep TCGA-04-1348-01 data/samples.tsv
grep TCGA-04-1348-01 data/mutation-matrix.tsv.bz2

So I guess the question is why is core-service requesting an ID that doesn't exist? Perhaps TCGA-04-1348-01 used to exist but no longer exists in v7?

gwaybio commented 6 years ago

hmm, this is interesting. I am glad that core-services caught this, even if it wasn't supposed to. I did some digging into why.

It looks like tumors with 04 acronyms (e.g. TCGA-04-1348-01) are high grade serous ovarian (HGSC) cancer tumors. HGSC has been of primary focus for me in past analyses.

Clinical Matrix all Primary Samples (that weren't redacted)

path = os.path.join('download', 'Survival_SupplementalTable_S1_20171025_xena_sp.tsv.gz')

# Mapping to rename and filter columns
renamer = collections.OrderedDict([
    ('sample', 'sample_id'),
    ('_PATIENT', 'patient_id'),
    ('cancer type abbreviation', 'acronym'),
    ('__placeholder__', 'disease'),
    ('age_at_initial_pathologic_diagnosis', 'age_diagnosed'),
    ('gender', 'gender'),
    ('race', 'race'),
    ('ajcc_pathologic_tumor_stage', 'ajcc_stage'),
    ('clinical_stage', 'clinical_stage'),
    ('histological_type', 'histological_type'),
    ('histological_grade', 'histological_grade'),
    ('initial_pathologic_dx_year', 'initial_pathologic_dx_year'),
    ('menopause_status', 'menopause_status'),
    ('birth_days_to', 'birth_days_to'),
    ('vital_status', 'vital_status'),
    ('tumor_status', 'tumor_status'),
    ('last_contact_days_to', 'last_contact_days_to'),
    ('death_days_to', 'death_days_to'),
    ('cause_of_death', 'cause_of_death'),
    ('new_tumor_event_type', 'new_tumor_event_type'),
    ('new_tumor_event_site', 'new_tumor_event_site'),
    ('new_tumor_event_site_other', 'new_tumor_event_site_other'),
    ('new_tumor_event_dx_days_to', 'days_recurrence_free'),
    ('treatment_outcome_first_course', 'treatment_outcome_first_course'),
    ('margin_status', 'margin_status'),
    ('residual_tumor', 'residual_tumor'),
    ('_EVENT', 'event_status'),
    ('_TIME_TO_EVENT', 'event_days'),
    ('OS', 'dead'),
    ('OS.time', 'days_survived'),
    ('DSS', 'disease_specific_survival_status'),
    ('DSS.time', 'disease_specific_survival_days'),
    ('DFI', 'disease_free_interval_status'),
    ('DFI.time', 'disease_free_interval_days'),
    ('PFI', 'progression_free_interval_status'),
    ('PFI.time', 'progression_free_interval_days'),
    ('Redaction', 'Redaction')
])

clinmat_df = (
    pandas.read_table(path)
    .rename(columns=renamer)
    .merge(disease_df, how='left')
    [list(renamer.values())]
    .set_index('sample_id', drop=False)
    .sort_values('sample_id')
)

# Fix capitalization of gender and race
clinmat_df.gender = clinmat_df.gender.str.title()
clinmat_df.race = clinmat_df.race.str.title()

# Extract sample-type with the code dictionary
clinmat_df = clinmat_df.assign(sample_type = clinmat_df.sample_id.str[-2:])
clinmat_df.sample_type = clinmat_df.sample_type.replace(sampletype_codes_dict)

# Remove "Redacted" tumors
# These patients either withdrew consent or had genome data mismatch errors
clinmat_df = clinmat_df.query('Redaction != "Redacted"').drop("Redaction", axis=1)

# Keep only these sample types
# filters duplicate samples per patient
sample_types = {
    'Primary Solid Tumor',
    'Primary Blood Derived Cancer - Peripheral Blood',
}
clinmat_df.query("sample_type in @sample_types", inplace=True)

Results in:

# Get acronym value counts
clinmat_df.acronym.value_counts()

BRCA    1092
GBM      588
UCEC     547
OV       537
KIRC     536
HNSC     528
LUAD     519
LGG      515
THCA     507
LUSC     504
PRAD     498
COAD     457
STAD     443
BLCA     409
LIHC     376
CESC     307
KIRP     291
SARC     261
LAML     200
ESCA     185
PAAD     185
PCPG     179
READ     166
TGCT     134
THYM     124
SKCM     108
ACC       92
MESO      87
UVM       80
KICH      66
UCS       57
DLBC      48
CHOL      36
Name: acronym, dtype: int64

Filtered Clinical Matrix (samples with mutation, RNAseq, and Clinical attributes)

sample_ids = sorted(clinmat_df.index & gene_mutation_mat_df.index & expr_df.index)

# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]

Results in:

sample_df.acronym.value_counts()

BRCA    787
LGG     510
LUAD    508
HNSC    499
PRAD    492
THCA    489
LUSC    477
UCEC    436
STAD    411
BLCA    403
KIRC    366
LIHC    357
COAD    287
CESC    286
KIRP    280
SARC    234
ESCA    183
PCPG    178
PAAD    168
GBM     149
TGCT    128
THYM    118
SKCM    103
READ     89
MESO     81
UVM      80
ACC      79
KICH     66
UCS      57
DLBC     37
CHOL     36
OV       14
Name: acronym, dtype: int64

OV (aka HGSC) goes from 537 to 14!

Why does OV drop so many samples?

My thought was that many of these OV tumors were filtered because they didn't have observed mutations. Indeed, here are all of the samples that were filtered from the mutation set.

sample_ids = set(clinmat_df.index) - set(gene_mutation_mat_df.index)

# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]
sample_df.acronym.value_counts()

OV      475
BRCA    303
GBM     279
LAML    200
COAD    169
KIRC    168
UCEC    100
READ     77
SARC     25
LUSC     24
HNSC     21
CESC     18
THCA     16
LIHC     14
DLBC     11
KIRP     10
PAAD     10
LUAD      7
MESO      6
TGCT      6
LGG       5
STAD      5
PRAD      5
SKCM      4
THYM      2
BLCA      2
PCPG      1
ESCA      1
Name: acronym, dtype: int64

Besides decimating the ovarian cancer sample set, this step also removes many additional tumors. My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples, but there are a bit more things to consider here. I intend on following up with this search in cognoma/cancer-data

cgreene commented 6 years ago

I was wondering why our HGSC numbers were so low. Can you use the PanCancerAtlas criteria?

kurtwheeler commented 6 years ago

Ah well this explains it. I just reran the scripts without changing the locations of the data because I thought it had been updated in place.

@dhimmel can you confirm https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2 is the location we should be pulling mutation data? Actually while you're at it can you check the rest of the urls in the acquiredata script?

Please remember that I have no context for why that script is the way it is nor why some of them are coming from raw.githubusercontent.com and some are coming from figshare. Likewise I have no idea what the relation between the figshare URLs you have posted and the raw.githubusercontent.com links you used in your python code above.

dhimmel commented 6 years ago

I just reran the scripts without changing the locations of the data because I thought it had been updated in place.

I support always using versioning such that the raw data URLs we use never point to changed versions of the data.

why some of them are coming from raw.githubusercontent.com and some are coming from figshare

In the past, we used fighsare to store large files, which did not fit in GitHub. However, we recently updated cancer-data to use Git LFS. As such, we can now download all files from GitHub. There is no need for the figshare. Therefore, I'd suggest switching everything to versioned GitHub URLs. All the files will be in this directory.

So in other words, we should either be using ALL figshare or ALL github URLs at this point. Given that uploading to figshare requires an extra step, we may want to switch to all github URLs, which will be easiest going forward.

You can add a single commit variable which we can change to control the source data version.

kurtwheeler commented 6 years ago

Cool, sounds good to me. I can point to that directory.

kurtwheeler commented 6 years ago

One issue with trying to use that directory: there's no diseases.tsv. Has this file been renamed?

dhimmel commented 6 years ago

One issue with trying to use that directory

diseases.tsv lives in the download directory. You can use the same commit hash, just change the path to download/diseases.tsv.

kurtwheeler commented 6 years ago

Is there a reason that's not consistent? I can make this Just Work with that being it's location, but it involves a tad bit extra code to correct for something that seems like it may not have been intentional.

dhimmel commented 6 years ago

Is there a reason that's not consistent

download is for inputs that we incorporate from external locations. data is for datasets we produce. Honestly, diseases.tsv fits in either location. Happy to move it if that will be helpful (after https://github.com/cognoma/cancer-data/pull/44 is merged).

kurtwheeler commented 6 years ago

Ah okay, I don't care much one way or another, just didn't want to write code to workaround a bug. However while testing my code for the diseases.tsv I noticed that genes.tsv is also missing. Has that been renamed to expression-genes.tsv?

dhimmel commented 6 years ago

Currently, genes.tsv is being retrieved from the cognoma/genes repository:

https://github.com/cognoma/core-service/blob/b9b2e4f37ac5b250d53ba30c08d94bf38d89c2dd/api/management/commands/acquiredata.py#L31-L34

If you continue to retrieve it from cognoma/genes, you should switch to using the version used by the version of cancer-data you consume. This commit is hardcoded in 0.genes-download.ipynb.

However, we'll need to look into how the genes table gets used. It may be possible that mutation-genes.tsv from cancer-data is more appropriate if we only care about genes in the context of mutations. mutation-genes.tsv contains one row per gene, but only for genes with mutation data.

dhimmel commented 6 years ago

New commit for cancer-data datasets: https://github.com/cognoma/cancer-data/commit/93e4c53dc3d58df4cf52d1a40179d62ccbc0b985

kurtwheeler commented 6 years ago

Note to self, use this for diseases: https://github.com/cognoma/cancer-data/blob/master/mapping/diseases.tsv