d3b-center / OpenPedCan-analysis

The analysis repository for the Open Pediatric Cancer Project
https://d3b-center.github.io/OpenPedCan-analysis/
Other
15 stars 13 forks source link

v15 data release - update TCGA and GTEX count and TPM matrices #551

Closed jharenza closed 4 months ago

jharenza commented 4 months ago

Which new datasets are being added with this release?

What is the sample breakdown (number of WGS, WXS, RNA-Seq, Panel, Methylation, other)?

Same as v14

What module(s) generated any new files to include in the release? Has that module been added to the analysis/README.md, and to CI?

NA

Are you aware of any modules impacted by the file(s) change(s)? Describe if the file name is changed.

No

What data file(s) are added/updated/removed in this release?

GTEX and TCGA counts and TPM matrices will be updated

[Pre-release files]

[Commit files]

[Bed files and sample mapping]

[File descriptions and notes]

Any additional notes to add for discussion?

After discussions between @komalsrathi , Diskin Lab, @migbro, and @zhangb1 , we found that the collapse-rnaseq post GTEX and TCGA liftover has a bug in that there are genes missing from the collapsed count and tpm matrices. For example, "CD99" in the below pre-collapse counts file for TCGA:

> tcga_counts_not_collapsed %>%
+   filter(grepl("CD99", gene_name)) %>%
+   select(gene_name, `TCGA-02-0047-01A-01R-1849-01`, `TCGA-02-0055-01A-01R-1849-01`)
# A tibble: 5 × 3
  gene_name `TCGA-02-0047-01A-01R-1849-01` `TCGA-02-0055-01A-01R-1849-01`
  <chr>                              <dbl>                          <dbl>
1 CD99L2                              6159                           5116
2 CD99P1                                 0                              0
3 CD99                                   0                              0
4 CD99P1                                96                            317
5 CD99                               23661                          36155

But after collapse, only CD99L2 exists.

We would like to update these 4 files to contain all genes which are non-0.

jharenza commented 4 months ago

complete with #554