broadinstitute / depmap_omics

What you need to process the Quarterly DepMap-Omics releases from Terra
https://depmap.org/portal/
110 stars 22 forks source link

Duplicate RNA-seq omics profiles for cell lines #167

Closed mjsteinbaugh closed 1 year ago

mjsteinbaugh commented 1 year ago

Hi DepMap team,

I'm trying to load transcript-level counts from the RNA-seq expression data from the OmicsExpressionTranscriptsTPMLogp1Profile.csv file. I'm having trouble uniquely resolving the profile identifiers (e.g. "PR-lqUArB", "PR-pOBrMJ") to the model identifiers ("ACH-000029") using the "OmicsProfiles.csv" file. I'm seeing 16 cell lines that currently have this issue with the 23q2 release.

Is there a rational way to pick between the duplicate profiles for the 16 cell lines?

Here's a working example (in R) with more details:

library(dplyr)
library(pipette)
df <-
    import(
        con = "https://figshare.com/ndownloader/files/40449635",
        format = "csv"
    ) |>
    filter(Datatype == "rna") |>
    arrange(ModelID)
dupes <- sort(df[["ModelID"]][duplicated(df[["ModelID"]])])
print(dupes)
##  [1] "ACH-000029" "ACH-000095" "ACH-000143" "ACH-000206" "ACH-000328"
##  [6] "ACH-000337" "ACH-000455" "ACH-000468" "ACH-000517" "ACH-000532"
## [11] "ACH-000556" "ACH-000597" "ACH-000700" "ACH-000931" "ACH-000975"
## [16] "ACH-001192"
df <- df[df[["ModelID"]] %in% dupes, ]
print(df)
##      ProfileID ModelConditionID    ModelID Datatype WESKit
## 28   PR-lqUArB   MC-000029-BMZc ACH-000029      rna   <NA>
## 29   PR-pOBrMJ   MC-000029-BMZc ACH-000029      rna   <NA>
## 94   PR-6E5fvI   MC-000095-UcYl ACH-000095      rna   <NA>
## 95   PR-9bHyjI   MC-000095-UcYl ACH-000095      rna   <NA>
## 143  PR-dlwhbG   MC-000143-xMKb ACH-000143      rna   <NA>
## 144  PR-eLOZCF   MC-000143-xMKb ACH-000143      rna   <NA>
## 207  PR-by8s63   MC-000206-Jmpg ACH-000206      rna   <NA>
## 208  PR-xissjH   MC-000206-Jmpg ACH-000206      rna   <NA>
## 328  PR-DjTYZp   MC-000328-gA4f ACH-000328      rna   <NA>
## 329  PR-S409MD   MC-000328-gA4f ACH-000328      rna   <NA>
## 338  PR-ZJC2Tm   MC-000337-VmHG ACH-000337      rna   <NA>
## 339  PR-zvd6KC   MC-000337-VmHG ACH-000337      rna   <NA>
## 456  PR-HCodtv   MC-000455-QvVM ACH-000455      rna   <NA>
## 457  PR-JWn3XA   MC-000455-QvVM ACH-000455      rna   <NA>
## 470  PR-8iDtve   MC-000468-c6hY ACH-000468      rna   <NA>
## 471  PR-qf7nCW   MC-000468-c6hY ACH-000468      rna   <NA>
## 518  PR-aOml9R   MC-000517-kcbL ACH-000517      rna   <NA>
## 519  PR-i9CVhO   MC-000517-kcbL ACH-000517      rna   <NA>
## 534  PR-Q8g8M0   MC-000532-NN9r ACH-000532      rna   <NA>
## 535  PR-t6ctGM   MC-000532-NN9r ACH-000532      rna   <NA>
## 559  PR-1hnFd4   MC-000556-YK2Z ACH-000556      rna   <NA>
## 560  PR-Iug0GM   MC-000556-YK2Z ACH-000556      rna   <NA>
## 601  PR-3ATZmJ   MC-000597-RDyO ACH-000597      rna   <NA>
## 602  PR-w00MaJ   MC-000597-RDyO ACH-000597      rna   <NA>
## 704  PR-hs3wNI   MC-000700-mndS ACH-000700      rna   <NA>
## 705  PR-uQ6qid   MC-000700-mndS ACH-000700      rna   <NA>
## 934  PR-CHQ9Av   MC-000931-3a7D ACH-000931      rna   <NA>
## 935  PR-k0J8JP   MC-000931-3a7D ACH-000931      rna   <NA>
## 979  PR-eCRyEu   MC-000975-PUD5 ACH-000975      rna   <NA>
## 980  PR-s28NRl   MC-000975-PUD5 ACH-000975      rna   <NA>
## 1056 PR-RJkk8B   MC-001192-OhhV ACH-001192      rna   <NA>
## 1057 PR-V4rEyG   MC-001192-OhhV ACH-001192      rna   <NA>

Best, Mike

5im1z commented 1 year ago

Hi Mike,

In OmicsDefaultModelProfiles.csv, you can find information on which PR-ids we select to represent a model when there are multiple profiles available for the same model.

Thanks, Simone

mjsteinbaugh commented 1 year ago

Oh amazing thanks Simone!