Closed komalsrathi closed 9 months ago
For more context, it looks like gene-counts-rsem-expected_count-collapsed.rds
contained TCGA counts in v12 and the TCGA cohort was not pulled out into a separate count file for v13-v14. Unintentionally missed.
I don't know if my local file is messed up but I loaded the v12 file and can't find TCGA counts in the gene-counts-rsem-expected_count-collapsed.rds
file:
all_counts_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/gene-counts-rsem-expected_count-collapsed.rds")
> grep("TCGA", colnames(all_counts_v12))
integer(0)
Also, any idea why are there ENSG
ids in the rows for v14 TCGA TPM data? This is supposed to be a collapsed file with gene-names as rownames right?
tpm_dat = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds")
grep("^ENSG", rownames(tpm_dat)) %>% length()
[1] 15692
I don't know if my local file is messed up but I loaded the v12 file and can't find TCGA counts in the
gene-counts-rsem-expected_count-collapsed.rds
file:all_counts_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/gene-counts-rsem-expected_count-collapsed.rds") > grep("TCGA", colnames(all_counts_v12)) integer(0)
I am not sure these were ever in the combined counts matrix actually, since we had separate TPM files for them.
Also, any idea why are there ENSG ids in the rows for v14 TCGA TPM data? This is supposed to be a collapsed file with gene-names as rownames right?
tpm_dat = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds") grep("^ENSG", rownames(tpm_dat)) %>% length() [1] 15692
This is expected- those previous psuedo genes are now all ENSG as the official symbols, I suppose until renamed
it looks like gene-counts-rsem-expected_count-collapsed.rds contained TCGA counts in v12
https://github.com/d3b-center/OpenPedCan-analysis/issues/549#issuecomment-1969615838 was in response to your comment above, I thought you meant that the gene-
file contained TCGA counts in v12 release and I was not able to find TCGA samples in that file.
it looks like gene-counts-rsem-expected_count-collapsed.rds contained TCGA counts in v12
#549 (comment) was in response to your comment above, I thought you meant that the
gene-
file contained TCGA counts in v12 release and I was not able to find TCGA samples in that file.
I think my comment was wrong and we never included them in any release.
The last TCGA collapsed counts file tcga-gene-counts-rsem-expected_count-collapsed.rds
is in v11/
release (no collapsed file in v12-v14; and we have been using v11 release to get TCGA data for tumor boards). But I think there was no liftover done for that release, right? Sample sizes also look different between v11 and v12 TCGA matrices.
Wondering if there is an s3 location I could get this file from in the meanwhile?
Below is the comparison of the TPM data between v11 and v12-v14 (as there is no counts for v12-v14, no comparison can be done). As you can see, the number of samples and genes are different between the releases v11 and v12-v14.
ENSG
ids as gene names. Just want to confirm.# v11
> tpm_v11 = readRDS("~/Projects/OpenPedCan-analysis/data/v11/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v11)
[1] 59427 11123
> grep("ENSG", rownames(tpm_v11)) %>% length()
[1] 0
# v12
tpm_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/tcga-gene-expression-rsem-tpm-collapsed.rds")
dim(tpm_v12)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v12)) %>% length()
[1] 15692
# v13
tpm_v13 = readRDS("~/Projects/OpenPedCan-analysis/data/v13/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v13)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v13)) %>% length()
[1] 15692
# v14
> tpm_v14 = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v14)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v14)) %>% length()
[1] 15692
Any ideas @migbro @zhangb1 ?
Update: The TCGA counts file was missed earlier so Bo/Miguel are working on regenerating the liftover + collapsed
TCGA counts file.
Regarding the sample size discrepancy, looks like the TPM task output https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/tasks/cd8a31d1-71d4-4098-b798-e2e2d01dea5f/ has all the samples from v11 so I am not sure when/where some of the samples were filtered out before uploading them in the s3 bucket for v12-v14.
@komalsrathi is it possible clinical data is only available for the ones in v14 and that's why they were removed?
Yes after checking v11, it makes sense. Even though the samples were present in the v11 matrices, it didn't have histology information on those 712 samples as well. Maybe that's why the decision was made to remove those samples from the TPM/Counts matrices in the next releases.
Now, everything makes sense after subsetting the samples except the number of genes.
The liftover + collapsed + subsetted
matrices have the following dimensions:
[1] 59287 10411
The current v14 TPM has the following dimensions:
[1] 54320 10411
I am not sure what filters were applied at the gene-level but maybe it is best to just subset to the genes in the current v14 TPM file. I'll upload to s3 and update once done.
File uploaded to: s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds
md5sum data/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds
2eadb7d9e83533537f600cc9b192204a data/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds
complete with #554
What data file(s) does this issue pertain to?
tcga-gene-counts-rsem-expected_count-collapsed.rds
What release are you using?
v14
Put your question or report your issue here.
Missing TCGA collapsed counts file in v14 release (not sure if this is intentional).