d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

Remap TCGA expression matrix to GENCODE v39 #521

Closed jharenza closed 1 year ago

jharenza commented 1 year ago

What data file(s) does this issue pertain to?

tcga-gene-expression-rsem-tpm-collapsed.rds currently located here

What release are you using?

pre-v12 file

Put your question or report your issue here.

TCGA expression is currently on GENCODE v36, yet still 23,242 gene symbols are contained in TCGA matrix not in the v39 expression matrix and 20,979 gene symbols are in v39 expression matrix not in TCGA matrix. We need to remap these ENSG symbols from v36 to v39 - can be done one of two ways:

  1. Current matrix gene symbol --> ENSG v36 --> ENSG v39 --> gene symbol
  2. Do this during the merge before the mapping step. Merge all ENSG ids' expression v36 --> ENSG v39 --> Hugo symbol

In addition, I am seeing 712 samples in TCGA merged matrix not in the histologies file (this is OK - maybe @ewafula could not find clinical info), but there are 3 samples in the histologies file not in the merged matrix:

"TCGA-DK-A6AW-01A-11R-A30C-07" "TCGA-E7-A97Q-01A-11R-A38B-07" "TCGA-W5-AA2R-01A-11R-A41I-07"

@zhangb1 if we don't have data for these, can you let me know - we can remove from histologies - @ewafula can prepare a new TCGA file without them.

Who will complete this task?

@migbro @zhangb1 can you take point on this to determine the best path forward?

cc @chinwallaa @taylordm

zhangb1 commented 1 year ago

I tried to use the GDSC query :

https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D&searchTableTab=files

and I couldn't find those 3 samples.... Not sure they got deleted

jharenza commented 1 year ago

@ewafula can you remove these 3 from Tcga hist and PR to dbt repo?

migbro commented 1 year ago

Damn, wrong ticket, I meant for GTeX. I'll delete that comment to reduce confusion

migbro commented 1 year ago

Currently working on a script that will liftover the gene symbols and then collapse on gene symbols. After that, someone else will have to sort/remove any entries that don't appear in our PBTA/KF sets.

jharenza commented 1 year ago

perfect @migbro

migbro commented 1 year ago

@ewafula @chinwallaa , ok , I have written a tool that does as I describe above. It's a python script that just uses base packages. Where should I put it? Should I wrap it in a cwl tool?

jharenza commented 1 year ago

@migbro i wonder if we should hold on that until further qc- we are releasing all methylation but there are some which suggest mis-id / sample swaps (while some may be real biology suggesting a different diagnosis). Or perhaps take only those whose classification do indeed match the diagnosis? Would need semi-manual curation though... maybe discuss with Adam?

ewafula commented 1 year ago

@migbro, I think @jharenza cross posted here. The above message is meant for bixu ticket #1752.

As for your question above, @zhangb1, who generates the OPC gene expression matrices can provide more details. I am assuming the tool goes to cavatica for @zhangb1's team to utilize for generating v12 TCGA/GTEx expression matrices.

@zhangb1?

migbro commented 1 year ago

TPM task here: https://cavatica.sbgenomics.com/u/d3b-bixu-ops/open-target-tcga-rnaseq-counts/tasks/cd8a31d1-71d4-4098-b798-e2e2d01dea5f/

migbro commented 1 year ago

@ewafula @chinwallaa was the liftover successful? If so, I'd like to close this and the GTeX tickets

ewafula commented 1 year ago

@migbro, yes. Already using results with downstream modules. Thank you 🙏

migbro commented 1 year ago

Success! Closing