Molecular alteration CSV for Zenodo upload

sjspielman commented 1 year ago

Part of #1692

This PR begins (code needs help/feedback during review!) the process of getting a molecular alterations csv. At this point, I decided to file for feedback since my brain keeps getting twisted thinking of the right way to join back with DNA and RNA ids for the ambiguous samples.

The new code involves:

A script tables/tabulate-molecular-alterations.R to generate this CSV file
A script tables/util/functions-molecular-alterations.R with some extremely context-specific functions, as indicated in comments in that file.

The tables/tabulate-molecular-alterations.R script takes the following steps until we get to:

Process maf, cnv, fusion data
- mirroring the approach in https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/627ec427ad0a8d9d913e614c9db50546c56d8283/analyses/oncoprint-landscape/01-map-to-sample_id.R.
Combine those processed df's into a big alteration df
Filter to genes present in the manuscript oncoplots, and create a presence/absence (with "Y"/"N") data frame for whether each gene/alteration is present in each of the 1074 tumors.
Map back to biospecimen ids for export 🥴

Step 4 is either fairly straightforward, or very not!

I have created export_identifiable_ids which is an export-ready wide data frame for non-ambiugous ids that have 1:1 DNA:RNA.
But, I still need a solid strategy for dealing with the non 1:1 mappings. There's two types of problematic mappings:
- Where there are >1 RNA or DNA experiments, so we can't directly line up a DNA with RNA. This is a 1:many situation, like we discussed.
- Where there are >1 of a given modality - eg 2 RNAs, or even a case with 6 DNAs!. This is, well, just a plain "many."

One potential approach that I was thinking of is to just add these sample_ids and alterations into the main export df (called export_identifiable_ids at the moment), but keep their biospecimen id values as NA . Then, we could have a separate file to indicate what those ids might be, with columns:

sample_id
Kids_First_Biospecimen_ID
experimental_strategy I don't think this approach will make a downstream user's life easier, but I also think they're (we're?) just in for a hard time anyways with these samples...

sjspielman commented 1 year ago

Requested @jaclyn-taroni for review here mostly for purposes of strategizing ambiguous ids as needed!

sjspielman commented 1 year ago

This closes #1726.

sjspielman commented 1 year ago

Hm I'd just like to note, for the record, that this branch name, which I personally named, is....not how my name is spelled.

sjspielman commented 1 year ago

I'm confused by the difference between ambiguous and multi-mapping. From your description they both sound like multimapping, just of degree?

Yes, you've got it! Both are ambiguous situations, but I should have been clear to say that one flavor is ambiguity for mapping RNA-DNA (1:many situation), and the other flavor is, for example, which DNA id (if there are several) is being referred to. The latter is not necessary 1:many.

Either way, I think making the DNA/RNA columns be semicolon-separated lists of ids is the most straightforward, and accomplishing that with an appropriate group_by/summarize seems the simplest approach. But I could be missing something major here!

I don't disagree at all, but I have in my notes from yesterday's chat about this that we decided to avoid that strategy and instead have a separate column to indicate if the given single pair of RNA/DNA ids is a 1:1 or 1:many situation. Maybe I'll go for a why-not-both.gif situation - have semi-colon separated ids and a T/F column to indicate "ambiguous mapping".

sjspielman commented 1 year ago

Maybe I'll go for a why-not-both.gif situation - have semi-colon separated ids and a T/F column to indicate "ambiguous mapping".

As a move towards next steps, I went ahead and did this in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/42d7ee84feb110fdced906871471be3d57a50f7a. The output CSV is manageable at 8MB without compression, but can compress if we want a tiny file!

sjspielman commented 1 year ago

@jashapiro while working to continue cleaning this process up, I came across this neat situation which I did not anticipate, but glad I caught it:

The two RNA samples have "different" sexes - Male and NA. So, if we have these ids on the same row, then we have a conflict with this metadata variable. On one hand, we could defer to the non-NA value here and proceed with Male, but also I am wary about changing metadata here. But back to the first hand, we could document here that we use the non-NA sex if this conflict occurs.

And on a third hand, it seems that there are only 2 sample id's like this! For such a limited scenario, maybe just using the non-NA and explaining in README is indeed the way to go.

tumor_sample_ids_df %>% 
  dplyr::select(sample_id, germline_sex_estimate) %>% 
  dplyr::distinct() %>%
  dplyr::count(sample_id) %>%
  dplyr::filter(n>1)
# A tibble: 2 x 2
  sample_id     n
  <chr>     <int>
1 7316-161      2
2 7316-85       2

sjspielman commented 1 year ago

This has now been overhauled and is ready for more 👀 ! Initial maf, fusion, & cnv wrangling has been heavily simplified for tasks at hand. The final output CSV has columns as described in this comment: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#pullrequestreview-1363810734. Importantly, this final export ends up with one row per biospecimen ID which makes the ambiguous ID situation mapping moot. I wondered for a while if I was missing something major for this outcome to occur, but it seems reasonable to me...am I missing something?

jaclyn-taroni commented 1 year ago

I was not expecting the DNA and RNA alterations in separate rows. (To be fair, I didn't say anything about IDs in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#pullrequestreview-1363810734.) For 1:1 mappings between assays, I would expect one row per sample_id.

sjspielman commented 1 year ago

For 1:1 mappings between assays, I would expect one row per sample_id.

Going back to update code accordingly to have 1 row where possible, one item occurred to me as a potential reason to maybe not do that that I'd like to hear thoughts on! The review itself asks for much more metadata to be part of this export, including whether certain experimental modalities are available (column for has_RNAseq, has_WGS, etc.). These kinds of columns will end up doubling down on some ambiguity - e.g. sure WGS might be available, but which biospecimen is it? Having 1 row for each biospecimen locks down this particular metadata item, and I think it gives more clarity as to which biospecimen is each type of experiment.

Either way, I'm working now towards 1-row-per either way, and this is something to think about!

jaclyn-taroni commented 1 year ago

@jashapiro I'm making edits to this and will tag you when it is ready for a look

jaclyn-taroni commented 1 year ago

Okay, we now have a CSV file with the following columns:

`sample_id`	`Kids_First_Biospecimen_ID_DNA`	`Kids_First_Biospecimen_ID_RNA`	`multiple_assays_within_type`	`sample_type`	`composition`	`germline_sex_estimate`	`broad_histology`	`cancer_group`	`ACVR1`	...	`ZIC1`

Where:

(Almost) each row is a unique sample_id-composition pair – more on the "almost" in a second.
Kids_First_Biospecimen_ID_DNA are all DNA assay biospecimen IDs associated with the sample_id-composition pair in that row, separated by ;.
Kids_First_Biospecimen_ID_RNA are all RNA assay biospecimen IDs associated with the sample_id-composition pair in that row, separated by ;.
multiple_assays_within_type indicates whether there is more than one DNA or more than one RNA assay for that sample_id-composition pair.
A gene column will have all alterations found in at least one biospecimen for each assay type (RNA and DNA). For multiple alterations, they are separated by ;. If no alterations were detected given the available assays, it says "None."

You can see how I dealt with the issue mentioned in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#issuecomment-1489204244 here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/files#diff-7eed238487c7c49c705f8ab849980cf0183d136fd84054fc833cdbe656cc14e7R365-R375

One sample_id-composition pair has multiple cancer groups / molecular subtype, resulting in 1075 rows:

   > histologies_df %>% filter(sample_id == "7316-3230") %>% select(sample_id, Kids_First_Biospecimen_ID, composition, broad_histology,cancer_group, molecular_subtype)
# A tibble: 4 × 6
  sample_id Kids_First_Biospecimen_ID composition  broad_histology                               cancer_group                  molecular_subtype          
  <chr>     <chr>                     <chr>        <chr>                                         <chr>                         <chr>                      
1 7316-3230 BS_5968GBGT               Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma        DMG, H3 K28, TP53 loss     
2 7316-3230 BS_AF5D41PD               Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma        DMG, H3 K28, TP53 loss     
3 7316-3230 BS_EE73VE7V               Solid Tissue Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma HGG, H3 wildtype, TP53 loss
4 7316-3230 BS_HYKV2TH9               Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma        DMG, H3 K28, TP53 loss

The instance with HGG, H3 wildtype, TP53 loss reports a different primary site. Not really sure what to do about this issue here. cc: @jharenza

jaclyn-taroni commented 1 year ago

@jashapiro this is ready for a look. There an issue with one sample I noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#issuecomment-1492444139

jaclyn-taroni commented 1 year ago

Shooting from the hip – I suspect the missing sample IDs have something to do with the copy number alteration data because I believe all of those samples are assayed with WXS. We only have CNV data for WGS samples, if I recall correctly.

jaclyn-taroni commented 1 year ago

Now having looked, I'm reasonably sure it's the handling of NA germline sex estimate for WXS samples.

jaclyn-taroni commented 1 year ago

I made some changes in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/d13c5086ff6a8e85e0e6b8c82fa0f69e75f79466. Please take another look @sjspielman!

jharenza commented 1 year ago

One sample_id-composition pair has multiple cancer groups / molecular subtype, resulting in 1075 rows:

histologies_df %>% filter(sample_id == "7316-3230") %>% select(sample_id, Kids_First_Biospecimen_ID, composition, broad_histology,cancer_group, molecular_subtype)

A tibble: 4 × 6

sample_id Kids_First_Biospecimen_ID composition broad_histology cancer_group molecular_subtype
1 7316-3230 BS_5968GBGT Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss 2 7316-3230 BS_AF5D41PD Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss 3 7316-3230 BS_EE73VE7V Solid Tissue Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma HGG, H3 wildtype, TP53 loss 4 7316-3230 BS_HYKV2TH9 Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss The instance with HGG, H3 wildtype, TP53 loss reports a different primary site. Not really sure what to do about this issue here. cc: @jharenza

This was the one sample in which VarDict was the only algorithm which called H3K28M - it was definitely real, and prompted us to use clinical reports of H3 status (below), but this sample did not have a post-mortem clinical report and thus, it was decided we would not change this one to match the other samples taken at the same time (I think). I think it would be fine to merge all of the DNA for this sample into one field and use the combination of molecular alterations (eg if one sample has ABC and one has BCD, then the resulting alterations would be ABCD).

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/cf6328ebd7c3bdffa18b4bd6b8d856713a8414aa/analyses/molecular-subtyping-pathology/03-incorporate-clinical-feedback.Rmd#L35-L36

sjspielman commented 1 year ago

Here is something we need to solidify - Pictured are the rows whose given sample_id appears in more than one row in the final_alteration_df. A single sample can be in multiple rows if there is a different composition associated, which seems fair to me. But, have a look at the multiple_assays_within_type column, and let's consider the two cases:

7316-1746: Has TRUE in this column for both associated rows, even though one of the compositions does not have more than 1 assay
7316-1769: Has FALSE in this column for both associated rows.

So, how do we want to consider if there is >1 assay? Is this at the level of sample_id, or joint sample_id and composition.

sjspielman commented 1 year ago

Also we need to have a specific look at one of those samples "7316-3230" which has, excitingly, different cancer groups, and therefore ends up as an "incorrectly duplicated" row.

From histologies:

Not really sure what to make of this one in general...

jaclyn-taroni commented 1 year ago

Also we need to have a specific look at one of those samples "7316-3230" which has, excitingly, different cancer groups, and therefore ends up as an "incorrectly duplicated" row.

From histologies:

Not really sure what to make of this one in general...

Are you seeing this after the changes in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/b0f9d2881ef41954bf87644179d7ded4dc068a04?

jaclyn-taroni commented 1 year ago

Edit: quote reply is having trouble 🤔

So, how do we want to consider if there is >1 assay? Is this at the level of sample_id, or joint sample_id and composition

Thanks for catching this! Should be fixed in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/ed6f28fbb11b089db138aabe51a6c3a605108dfd

sjspielman commented 1 year ago

Ah, I had missed https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/b0f9d2881ef41954bf87644179d7ded4dc068a04! :tada: it's not there, success :)

sjspielman commented 1 year ago

I can't approve this as it is "my PR", but I think we're there, at least as far as I can see!

AlexsLemonade / OpenPBTA-analysis

Molecular alteration CSV for Zenodo upload #1725

A tibble: 4 × 6