Closed gwaybio closed 6 years ago
I am working on adding this now
I am working on adding this now
Excited to review PR.
I wonder how much Entrez Gene has changed and whether we should also update cognoma/genes?
I wonder how much Entrez Gene has changed and whether we should also update cognoma/genes?
It may have updated some - here are the unmapped genes from mapping/HiSeqV2-genes/map-HiSeqV2-genes.ipynb
{'100130426',
'100133144',
'100134869',
'10357',
'10431',
'136542',
'155060',
'26823',
'280660',
'317712',
'340602',
'388795',
'390284',
'391343',
'391714',
'404770',
'441362',
'442388',
'553137',
'57714',
'645851',
'652919',
'653553',
'728045',
'728603',
'728788',
'729884',
'8225',
'90288'}
There also appears to be an issue we might need to reach out to the UCSC Xena team about.
1,751 out of 11,069 samples have NA
gene expression values (it affects 4,196 out of 20,531 genes). I am double checking with the synapse resource now
1,751 out of 11,069 samples have NA gene expression values (it affects 4,196 out of 20,531 genes).
From the synapse resource:
Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.
I think replacing all NA
genes with 0 is the way to go. The genes with systematic batch effects (columns of all zeros) will be filtered using MAD genes. Genes with mostly zero reads would be close to zero anyway.
Genes with mostly zero reads or with residual batch effects (approx. 2-3k or 10% of genes) were removed from the adjusted samples and replaced with NAs. No genes were removed from samples with "No Change" status.
Interesting. So an entire column will be zero (like all values for that gene) or just for some samples. We should remove any genes were all values are missing. Or even if a high percent is missing. It seems that missing due to "residual batch effects" is not really a zero? Although "mostly zero reads" is reasonably zero.
I looked into this a bit more. Here is the distribution of genes with NA values by the number of impacted samples:
Also, all 29 of the unmapped_symbols
are included in these 4,196 genes with NA
measurements leaving the number of impacted genes in our final list to be 4,167.
With this distribution, I agree that it is best to throw out all all genes without complete measurements. I think this is preferable to throwing out samples.
Some notes:
The x-axis for first plot is "number of missing samples for gene"
So the final dimension of the expression matrix will be 8,388 samples x 16,261 genes. The number of samples previously was 7,306 (cancers with both expr, mutation, and clinical data). So this update is not a huge growth.
With this distribution, I agree that it is best to throw out all all genes without complete measurements. I think this is preferable to throwing out samples.
For Cognoma's use case where expression is used as features and one gene can compensate for the lack of another gene, I agree.
The update also provides refinement to mutation calls. So our Y matrices will be more accurate.
@gwaygenomics see the version date from the Xena JSON metadata:
Expression and mutation data are from 2016-12-29. Are we sure this is the final release or perhaps Xena still has not incorporated the newest data?
Are we sure this is the final release or perhaps Xena still has not incorporated the newest data?
These are the final release files - it seems like on December 29th, 2016 the xena team processed them. PanCanAtlas RNAseq data was frozen 21-March-2016. Not sure when the MC3 mutation data was frozen (it was after March 21, 2016), but version 0.2.8
was also used in the publication.
Hi to All, I am working on the project "Multi-category Tumor Classification using Deep Learning based on Gene Expression Data" as for that, I have downloaded the following file: EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv however there are no labels of the tumor classes present in the dataset, I am new to the genomic field, please help me by providing the link or a method to download the appropriate dataset.
It's urgent so any help will be appreciated.
@kulshrestha97, please only comment on a GitHub issue if your comment is related to previous discussion. For new questions such as yours, you should create new issues. However, cognoma/cancer-data
is not an appropriate repository for your question, because we do not create EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
. Please consider requesting support directly from Xena.
The PanCanAtlas released 27 open access papers and updated data last week.
The UCSC Xena team also added this version to their database! An overview of the updated data is here.
We should update our download and the figshare so that cognoma runs with the most recent PanCanAtlas version