d3b-center / OpenPedCan-analysis

The analysis repository for the Open Pediatric Cancer Project
https://d3b-center.github.io/OpenPedCan-analysis/
Other
15 stars 13 forks source link

Missing TCGA counts from v14 #549

Closed komalsrathi closed 4 months ago

komalsrathi commented 5 months ago

What data file(s) does this issue pertain to?

tcga-gene-counts-rsem-expected_count-collapsed.rds

What release are you using?

v14

Put your question or report your issue here.

Missing TCGA collapsed counts file in v14 release (not sure if this is intentional).

jharenza commented 5 months ago

For more context, it looks like gene-counts-rsem-expected_count-collapsed.rds contained TCGA counts in v12 and the TCGA cohort was not pulled out into a separate count file for v13-v14. Unintentionally missed.

komalsrathi commented 4 months ago

I don't know if my local file is messed up but I loaded the v12 file and can't find TCGA counts in the gene-counts-rsem-expected_count-collapsed.rds file:

all_counts_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/gene-counts-rsem-expected_count-collapsed.rds")
> grep("TCGA", colnames(all_counts_v12))
integer(0)
komalsrathi commented 4 months ago

Also, any idea why are there ENSG ids in the rows for v14 TCGA TPM data? This is supposed to be a collapsed file with gene-names as rownames right?

tpm_dat = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds")
grep("^ENSG", rownames(tpm_dat)) %>% length()
[1] 15692
jharenza commented 4 months ago

I don't know if my local file is messed up but I loaded the v12 file and can't find TCGA counts in the gene-counts-rsem-expected_count-collapsed.rds file:

all_counts_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/gene-counts-rsem-expected_count-collapsed.rds")
> grep("TCGA", colnames(all_counts_v12))
integer(0)

I am not sure these were ever in the combined counts matrix actually, since we had separate TPM files for them.

Also, any idea why are there ENSG ids in the rows for v14 TCGA TPM data? This is supposed to be a collapsed file with gene-names as rownames right?

tpm_dat = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds") grep("^ENSG", rownames(tpm_dat)) %>% length() [1] 15692

This is expected- those previous psuedo genes are now all ENSG as the official symbols, I suppose until renamed

komalsrathi commented 4 months ago

it looks like gene-counts-rsem-expected_count-collapsed.rds contained TCGA counts in v12

https://github.com/d3b-center/OpenPedCan-analysis/issues/549#issuecomment-1969615838 was in response to your comment above, I thought you meant that the gene- file contained TCGA counts in v12 release and I was not able to find TCGA samples in that file.

jharenza commented 4 months ago

it looks like gene-counts-rsem-expected_count-collapsed.rds contained TCGA counts in v12

#549 (comment) was in response to your comment above, I thought you meant that the gene- file contained TCGA counts in v12 release and I was not able to find TCGA samples in that file.

I think my comment was wrong and we never included them in any release.

komalsrathi commented 4 months ago

The last TCGA collapsed counts file tcga-gene-counts-rsem-expected_count-collapsed.rds is in v11/ release (no collapsed file in v12-v14; and we have been using v11 release to get TCGA data for tumor boards). But I think there was no liftover done for that release, right? Sample sizes also look different between v11 and v12 TCGA matrices.

Wondering if there is an s3 location I could get this file from in the meanwhile?

komalsrathi commented 4 months ago

Below is the comparison of the TPM data between v11 and v12-v14 (as there is no counts for v12-v14, no comparison can be done). As you can see, the number of samples and genes are different between the releases v11 and v12-v14.

  1. Were samples removed for some reason?
  2. Is the number of genes different because of lifting over to Gencode v39? That could also explain the introduction of ENSG ids as gene names. Just want to confirm.
# v11
> tpm_v11 = readRDS("~/Projects/OpenPedCan-analysis/data/v11/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v11)
[1] 59427 11123
> grep("ENSG", rownames(tpm_v11)) %>% length()
[1] 0

# v12
tpm_v12 = readRDS("~/Projects/OpenPedCan-analysis/data/v12/tcga-gene-expression-rsem-tpm-collapsed.rds")
dim(tpm_v12)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v12)) %>% length()
[1] 15692

# v13
tpm_v13 = readRDS("~/Projects/OpenPedCan-analysis/data/v13/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v13)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v13)) %>% length()
[1] 15692

# v14
> tpm_v14 = readRDS("~/Projects/OpenPedCan-analysis/data/v14/tcga-gene-expression-rsem-tpm-collapsed.rds")
> dim(tpm_v14)
[1] 54320 10411
> grep("ENSG", rownames(tpm_v14)) %>% length()
[1] 15692

Any ideas @migbro @zhangb1 ?

komalsrathi commented 4 months ago

Update: The TCGA counts file was missed earlier so Bo/Miguel are working on regenerating the liftover + collapsed TCGA counts file.

Regarding the sample size discrepancy, looks like the TPM task output https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/tasks/cd8a31d1-71d4-4098-b798-e2e2d01dea5f/ has all the samples from v11 so I am not sure when/where some of the samples were filtered out before uploading them in the s3 bucket for v12-v14.

jharenza commented 4 months ago

@komalsrathi is it possible clinical data is only available for the ones in v14 and that's why they were removed?

komalsrathi commented 4 months ago

Yes after checking v11, it makes sense. Even though the samples were present in the v11 matrices, it didn't have histology information on those 712 samples as well. Maybe that's why the decision was made to remove those samples from the TPM/Counts matrices in the next releases.

komalsrathi commented 4 months ago

Now, everything makes sense after subsetting the samples except the number of genes.

The liftover + collapsed + subsetted matrices have the following dimensions:

[1] 59287 10411

The current v14 TPM has the following dimensions:

[1] 54320 10411

I am not sure what filters were applied at the gene-level but maybe it is best to just subset to the genes in the current v14 TPM file. I'll upload to s3 and update once done.

komalsrathi commented 4 months ago

File uploaded to: s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds

md5sum data/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds 
2eadb7d9e83533537f600cc9b192204a  data/v14/tcga-gene-counts-rsem-expected_count-collapsed.rds
jharenza commented 4 months ago

complete with #554