cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

TCGA PanCanAtlas Paper/Data Release #40

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

The PanCanAtlas released 27 open access papers and updated data last week.

The UCSC Xena team also added this version to their database! An overview of the updated data is here.

We should update our download and the figshare so that cognoma runs with the most recent PanCanAtlas version

gwaybio commented 6 years ago

I am working on adding this now

dhimmel commented 6 years ago

I am working on adding this now

Excited to review PR.

I wonder how much Entrez Gene has changed and whether we should also update cognoma/genes?

gwaybio commented 6 years ago

I wonder how much Entrez Gene has changed and whether we should also update cognoma/genes?

It may have updated some - here are the unmapped genes from mapping/HiSeqV2-genes/map-HiSeqV2-genes.ipynb

{'100130426',
 '100133144',
 '100134869',
 '10357',
 '10431',
 '136542',
 '155060',
 '26823',
 '280660',
 '317712',
 '340602',
 '388795',
 '390284',
 '391343',
 '391714',
 '404770',
 '441362',
 '442388',
 '553137',
 '57714',
 '645851',
 '652919',
 '653553',
 '728045',
 '728603',
 '728788',
 '729884',
 '8225',
 '90288'}

There also appears to be an issue we might need to reach out to the UCSC Xena team about.

1,751 out of 11,069 samples have NA gene expression values (it affects 4,196 out of 20,531 genes). I am double checking with the synapse resource now

gwaybio commented 6 years ago

1,751 out of 11,069 samples have NA gene expression values (it affects 4,196 out of 20,531 genes).

From the synapse resource:

Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.

I think replacing all NA genes with 0 is the way to go. The genes with systematic batch effects (columns of all zeros) will be filtered using MAD genes. Genes with mostly zero reads would be close to zero anyway.

dhimmel commented 6 years ago

Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.

Interesting. So an entire column will be zero (like all values for that gene) or just for some samples. We should remove any genes were all values are missing. Or even if a high percent is missing. It seems that missing due to "residual batch effects" is not really a zero? Although "mostly zero reads" is reasonably zero.

gwaybio commented 6 years ago

I looked into this a bit more. Here is the distribution of genes with NA values by the number of impacted samples:

na_in_geneexpression

Also, all 29 of the unmapped_symbols are included in these 4,196 genes with NA measurements leaving the number of impacted genes in our final list to be 4,167.

With this distribution, I agree that it is best to throw out all all genes without complete measurements. I think this is preferable to throwing out samples.

na_in_geneexpression_samples

dhimmel commented 6 years ago

Some notes:

The x-axis for first plot is "number of missing samples for gene"

So the final dimension of the expression matrix will be 8,388 samples x 16,261 genes. The number of samples previously was 7,306 (cancers with both expr, mutation, and clinical data). So this update is not a huge growth.

With this distribution, I agree that it is best to throw out all all genes without complete measurements. I think this is preferable to throwing out samples.

For Cognoma's use case where expression is used as features and one gene can compensate for the lack of another gene, I agree.

gwaybio commented 6 years ago

The update also provides refinement to mutation calls. So our Y matrices will be more accurate.

dhimmel commented 6 years ago

@gwaygenomics see the version date from the Xena JSON metadata:

https://github.com/cognoma/cancer-data/blob/383668e12a80ccbcc75a4930023aed16afbd208b/download/mc3.v0.2.8.PUBLIC.xena.json#L8

https://github.com/cognoma/cancer-data/blob/383668e12a80ccbcc75a4930023aed16afbd208b/download/EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.json#L9

https://github.com/cognoma/cancer-data/blob/383668e12a80ccbcc75a4930023aed16afbd208b/download/Survival_SupplementalTable_S1_20171025_xena_sp.json#L7

Expression and mutation data are from 2016-12-29. Are we sure this is the final release or perhaps Xena still has not incorporated the newest data?

gwaybio commented 6 years ago

Are we sure this is the final release or perhaps Xena still has not incorporated the newest data?

These are the final release files - it seems like on December 29th, 2016 the xena team processed them. PanCanAtlas RNAseq data was frozen 21-March-2016. Not sure when the MC3 mutation data was frozen (it was after March 21, 2016), but version 0.2.8 was also used in the publication.

kulshrestha97 commented 5 years ago

Hi to All, I am working on the project "Multi-category Tumor Classification using Deep Learning based on Gene Expression Data" as for that, I have downloaded the following file: EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv however there are no labels of the tumor classes present in the dataset, I am new to the genomic field, please help me by providing the link or a method to download the appropriate dataset.

It's urgent so any help will be appreciated.

dhimmel commented 5 years ago

@kulshrestha97, please only comment on a GitHub issue if your comment is related to previous discussion. For new questions such as yours, you should create new issues. However, cognoma/cancer-data is not an appropriate repository for your question, because we do not create EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv. Please consider requesting support directly from Xena.