Closed sjspielman closed 1 year ago
Requested @jaclyn-taroni for review here mostly for purposes of strategizing ambiguous ids as needed!
This closes #1726.
Hm I'd just like to note, for the record, that this branch name, which I personally named, is....not how my name is spelled.
I'm confused by the difference between ambiguous and multi-mapping. From your description they both sound like multimapping, just of degree?
Yes, you've got it! Both are ambiguous situations, but I should have been clear to say that one flavor is ambiguity for mapping RNA-DNA (1:many situation), and the other flavor is, for example, which DNA id (if there are several) is being referred to. The latter is not necessary 1:many.
Either way, I think making the DNA/RNA columns be semicolon-separated lists of ids is the most straightforward, and accomplishing that with an appropriate group_by/summarize seems the simplest approach. But I could be missing something major here!
I don't disagree at all, but I have in my notes from yesterday's chat about this that we decided to avoid that strategy and instead have a separate column to indicate if the given single pair of RNA/DNA ids is a 1:1 or 1:many situation.
Maybe I'll go for a why-not-both.gif
situation - have semi-colon separated ids and a T/F column to indicate "ambiguous mapping".
Maybe I'll go for a
why-not-both.gif
situation - have semi-colon separated ids and a T/F column to indicate "ambiguous mapping".
As a move towards next steps, I went ahead and did this in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/42d7ee84feb110fdced906871471be3d57a50f7a. The output CSV is manageable at 8MB without compression, but can compress if we want a tiny file!
@jashapiro while working to continue cleaning this process up, I came across this neat situation which I did not anticipate, but glad I caught it:
The two RNA samples have "different" sexes - Male
and NA
. So, if we have these ids on the same row, then we have a conflict with this metadata variable. On one hand, we could defer to the non-NA
value here and proceed with Male
, but also I am wary about changing metadata here. But back to the first hand, we could document here that we use the non-NA
sex if this conflict occurs.
And on a third hand, it seems that there are only 2 sample id's like this! For such a limited scenario, maybe just using the non-NA
and explaining in README
is indeed the way to go.
tumor_sample_ids_df %>%
dplyr::select(sample_id, germline_sex_estimate) %>%
dplyr::distinct() %>%
dplyr::count(sample_id) %>%
dplyr::filter(n>1)
# A tibble: 2 x 2
sample_id n
<chr> <int>
1 7316-161 2
2 7316-85 2
This has now been overhauled and is ready for more 👀 ! Initial maf, fusion, & cnv wrangling has been heavily simplified for tasks at hand. The final output CSV has columns as described in this comment: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#pullrequestreview-1363810734. Importantly, this final export ends up with one row per biospecimen ID which makes the ambiguous ID situation mapping moot. I wondered for a while if I was missing something major for this outcome to occur, but it seems reasonable to me...am I missing something?
I was not expecting the DNA and RNA alterations in separate rows. (To be fair, I didn't say anything about IDs in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#pullrequestreview-1363810734.) For 1:1 mappings between assays, I would expect one row per sample_id
.
For 1:1 mappings between assays, I would expect one row per
sample_id
.
Going back to update code accordingly to have 1 row where possible, one item occurred to me as a potential reason to maybe not do that that I'd like to hear thoughts on! The review itself asks for much more metadata to be part of this export, including whether certain experimental modalities are available (column for has_RNAseq
, has_WGS
, etc.). These kinds of columns will end up doubling down on some ambiguity - e.g. sure WGS might be available, but which biospecimen is it? Having 1 row for each biospecimen locks down this particular metadata item, and I think it gives more clarity as to which biospecimen is each type of experiment.
Either way, I'm working now towards 1-row-per either way, and this is something to think about!
@jashapiro I'm making edits to this and will tag you when it is ready for a look
Okay, we now have a CSV file with the following columns:
sample_id |
Kids_First_Biospecimen_ID_DNA |
Kids_First_Biospecimen_ID_RNA |
multiple_assays_within_type |
sample_type |
composition |
germline_sex_estimate |
broad_histology |
cancer_group |
ACVR1 |
... | ZIC1 |
---|
Where:
sample_id
-composition
pair – more on the "almost" in a second.Kids_First_Biospecimen_ID_DNA
are all DNA assay biospecimen IDs associated with the sample_id
-composition
pair in that row, separated by ;
.Kids_First_Biospecimen_ID_RNA
are all RNA assay biospecimen IDs associated with the sample_id
-composition
pair in that row, separated by ;
.multiple_assays_within_type
indicates whether there is more than one DNA or more than one RNA assay for that sample_id
-composition
pair.;
. If no alterations were detected given the available assays, it says "None."You can see how I dealt with the issue mentioned in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#issuecomment-1489204244 here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/files#diff-7eed238487c7c49c705f8ab849980cf0183d136fd84054fc833cdbe656cc14e7R365-R375
One sample_id
-composition
pair has multiple cancer groups / molecular subtype, resulting in 1075 rows:
> histologies_df %>% filter(sample_id == "7316-3230") %>% select(sample_id, Kids_First_Biospecimen_ID, composition, broad_histology,cancer_group, molecular_subtype)
# A tibble: 4 × 6
sample_id Kids_First_Biospecimen_ID composition broad_histology cancer_group molecular_subtype
<chr> <chr> <chr> <chr> <chr> <chr>
1 7316-3230 BS_5968GBGT Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss
2 7316-3230 BS_AF5D41PD Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss
3 7316-3230 BS_EE73VE7V Solid Tissue Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma HGG, H3 wildtype, TP53 loss
4 7316-3230 BS_HYKV2TH9 Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss
The instance with HGG, H3 wildtype, TP53 loss
reports a different primary site. Not really sure what to do about this issue here. cc: @jharenza
@jashapiro this is ready for a look. There an issue with one sample I noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725#issuecomment-1492444139
Shooting from the hip – I suspect the missing sample IDs have something to do with the copy number alteration data because I believe all of those samples are assayed with WXS. We only have CNV data for WGS samples, if I recall correctly.
Now having looked, I'm reasonably sure it's the handling of NA
germline sex estimate for WXS samples.
I made some changes in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/d13c5086ff6a8e85e0e6b8c82fa0f69e75f79466. Please take another look @sjspielman!
One sample_id-composition pair has multiple cancer groups / molecular subtype, resulting in 1075 rows:
histologies_df %>% filter(sample_id == "7316-3230") %>% select(sample_id, Kids_First_Biospecimen_ID, composition, broad_histology,cancer_group, molecular_subtype)
A tibble: 4 × 6
sample_id Kids_First_Biospecimen_ID composition broad_histology cancer_group molecular_subtype
1 7316-3230 BS_5968GBGT Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss 2 7316-3230 BS_AF5D41PD Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss 3 7316-3230 BS_EE73VE7V Solid Tissue Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma HGG, H3 wildtype, TP53 loss 4 7316-3230 BS_HYKV2TH9 Solid Tissue Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma DMG, H3 K28, TP53 loss The instance with HGG, H3 wildtype, TP53 loss reports a different primary site. Not really sure what to do about this issue here. cc: @jharenza
This was the one sample in which VarDict was the only algorithm which called H3K28M - it was definitely real, and prompted us to use clinical reports of H3 status (below), but this sample did not have a post-mortem clinical report and thus, it was decided we would not change this one to match the other samples taken at the same time (I think). I think it would be fine to merge all of the DNA for this sample into one field and use the combination of molecular alterations (eg if one sample has ABC and one has BCD, then the resulting alterations would be ABCD).
Here is something we need to solidify - Pictured are the rows whose given sample_id
appears in more than one row in the final_alteration_df
. A single sample can be in multiple rows if there is a different composition associated, which seems fair to me. But, have a look at the multiple_assays_within_type
column, and let's consider the two cases:
7316-1746
: Has TRUE
in this column for both associated rows, even though one of the compositions does not have more than 1 assay7316-1769
: Has FALSE
in this column for both associated rows. So, how do we want to consider if there is >1 assay? Is this at the level of sample_id
, or joint sample_id
and composition
.
Also we need to have a specific look at one of those samples "7316-3230"
which has, excitingly, different cancer groups, and therefore ends up as an "incorrectly duplicated" row.
From histologies
:
Not really sure what to make of this one in general...
Also we need to have a specific look at one of those samples
"7316-3230"
which has, excitingly, different cancer groups, and therefore ends up as an "incorrectly duplicated" row.From
histologies
:Not really sure what to make of this one in general...
Are you seeing this after the changes in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/b0f9d2881ef41954bf87644179d7ded4dc068a04?
Edit: quote reply is having trouble 🤔
So, how do we want to consider if there is >1 assay? Is this at the level of sample_id, or joint sample_id and composition
Thanks for catching this! Should be fixed in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/ed6f28fbb11b089db138aabe51a6c3a605108dfd
Ah, I had missed https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1725/commits/b0f9d2881ef41954bf87644179d7ded4dc068a04! :tada: it's not there, success :)
I can't approve this as it is "my PR", but I think we're there, at least as far as I can see!
Part of #1692
This PR begins (code needs help/feedback during review!) the process of getting a molecular alterations csv. At this point, I decided to file for feedback since my brain keeps getting twisted thinking of the right way to join back with DNA and RNA ids for the ambiguous samples.
The new code involves:
tables/tabulate-molecular-alterations.R
to generate this CSV filetables/util/functions-molecular-alterations.R
with some extremely context-specific functions, as indicated in comments in that file.The
tables/tabulate-molecular-alterations.R
script takes the following steps until we get to:Step 4 is either fairly straightforward, or very not!
export_identifiable_ids
which is an export-ready wide data frame for non-ambiugous ids that have 1:1 DNA:RNA.One potential approach that I was thinking of is to just add these
sample_id
s and alterations into the main export df (calledexport_identifiable_ids
at the moment), but keep their biospecimen id values asNA
. Then, we could have a separate file to indicate what those ids might be, with columns:sample_id
Kids_First_Biospecimen_ID
experimental_strategy
I don't think this approach will make a downstream user's life easier, but I also think they're (we're?) just in for a hard time anyways with these samples...