AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

PR 2 of 2: Visualizing CNS mutational signature exposures #1227

Closed sjspielman closed 2 years ago

sjspielman commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

This is the second PR, with #1226, for updating the mutational-signatures module.

What was your approach?

A notebook was created to make and export (currently in PDF) three plots, and hopefully two of those can be used for main text.

What GitHub issue does your pull request address?

1220

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Do we feel at least 2 of these could go to main text, or should further figure strategies be explored? Eg, codon mutations?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Three figures, both in the Rmd and exported to PDF. There is also a small table in the notebook giving some metadata about the top 10 mutated samples, which may be used in main text depending on which figures we incorporate.

What is your summary of the results?

N/A

Reproducibility Checklist

Documentation Checklist

jaclyn-taroni commented 2 years ago

Before I review this, a general question: Is there a reason not to use some of the functions in analyses/mutational-signatures/util/mut_sig_functions.R like bubble_matrix_plot() for example?

sjspielman commented 2 years ago

The data was formatted much differently from how I had been approaching setting up the data. The bubble plot function consumes a specifically-structured data frame from another function in that utils file, which I did not use, so I ended up writing my own code inspired by some aspects of utils.

Of course if preferred, I definitely can integrate the existing functions by revisiting some of my initial data processing, which will necessarily also involve updating those functions.

jaclyn-taroni commented 2 years ago

Going to summarize my takeaways from our Slack conversation earlier today:

With that being said, I'm probably going to wait to review this. If we didn't have the same takeaways, let me know!

sjspielman commented 2 years ago

Going to convert this to a draft pending feedback solicitation!

sjspielman commented 2 years ago

I have removed the barplot looking at exposures across short histology groups, and replaced with a new contender - a boxplot of exposure distributions across relevant Fig 3 cancer groups. For this figure, exposures are visualized for samples with non-zero exposures. The percentage shown above each individual boxplot is the percent of samples in that cancer group with non-zero weights for that signature. For example, 43% of craniopharyngiomas have signature 11.

Thoughts on this option?

Edit: I am aware the color mapping is buggy.

exposures_boxplot.pdf

jaclyn-taroni commented 2 years ago

@sjspielman - if 43% of craniopharyngiomas have non-zero values for signature 11, we'd expect then to plot the signature weight for n = (0.43 x craniopharyngioma samples), correct? If so, would we not benefit from showing the individual points given the low sample size?

jharenza commented 2 years ago

Hi @sjspielman! Thanks so much for working on this ๐Ÿ‘ . I have a few suggestions for the figures.

For the bubble_plot.pdf, would it be possible to make the medium and large circles much bigger to be able to see clearer differences in the proportions? I do have a bit of a hard time visually interpreting 0.6 vs 0.8.

There are two ways to label the signatures, which can be seen here: https://signal.mutationalsignatures.com/explore/studyTissueType/1-6. I have labeled figures with the RegSig name, and not CNS_A, etc. Do we prefer one naming scheme to another?

I agree with the RefSig name here, but do you have any info on what MMR2 (vs MMR1 is) and what N2 vs N6 is? They were not a part of the original RefSig set and this doesn't have an explanation.

I tend to like the stacked barplots within each cancer group for easier visual consumption, as in samples_barplot.pdf. Can we add a figure which plots these faceted by cancer group? It may be too busy, or go in the supplement, but may also point out some differences within cancer groups we can discuss.

In general, it may be helpful to order the axes by signature in numeric order 1 through N, then your MMR2 and N2, so they are in an easy-to-find order. I think I like the boxplots better than the bubble plot since the circles are hard to interpret as of now, so perhaps we do a group summary as the main takeaway (boxplot) and individual cancer summary. For the boxplot figure, can you add (N = ) to the facet label?

sjspielman commented 2 years ago

For the bubble_plot.pdf, would it be possible to make the medium and large circles much bigger to be able to see clearer differences in the proportions? I do have a bit of a hard time visually interpreting 0.6 vs 0.8.

Yes, this is a tricky aspect. Re-sizing the bubbles is something that I played around a lot with, and the setting you see here is about the most differentiated I could make. We also probably want to update the code here more generally, since I am currently not using existing functions to make the bubble plot but would like to clean that up once we solidify an overall set of plots we'd like to use in the first place.

I agree with the RefSig name here, but do you have any info on what MMR2 (vs MMR1 is) and what N2 vs N6 is? They were not a part of the original RefSig set and this doesn't have an explanation.

These are signatures that were identified as part of the original RefSig set, but they may not be directly referenced within the manuscript. However, they are in the supplementary tables here

I tend to like the stacked barplots within each cancer group for easier visual consumption, as in samples_barplot.pdf. Can we add a figure which plots these faceted by cancer group? It may be too busy, or go in the supplement, but may also point out some differences within cancer groups we can discuss.

Yes, I can work on something like this as another option for the paper.

In general, it may be helpful to order the axes by signature in numeric order 1 through N, then your MMR2 and N2, so they are in an easy-to-find order.

The current order is based on "other" signature names. These signatures have two names within the database - those I put int he figures, and alphabetical A-H. So they are currently arranged in A-H order. If this isn't particularly meaningful, then definitely we can move those tot he end.

I think I like the boxplots better than the bubble plot since the circles are hard to interpret as of now, so perhaps we do a group summary as the main takeaway (boxplot) and individual cancer summary.

I tend to agree.

For the boxplot figure, can you add (N = ) to the facet label? ๐Ÿ‘

sjspielman commented 2 years ago

Replying to @jaclyn-taroni's comment..

if 43% of craniopharyngiomas have non-zero values for signature 11, we'd expect then to plot the signature weight for n = (0.43 x craniopharyngioma samples), correct? If so, would we not benefit from showing the individual points given the low sample size?

First, are we cool that I removed those 0's in the first place? Removing the "rug of 0's" definitely improves the look of the viz but could be misleading. Hence, I added the fraction as possible middle ground, although with @jharenza comment about adding N= to facet labels, I could also change these fractions to actual numbers.

Second, are you thinking an overlayed sina/jitter over a boxplot (whose outliers are hidden)? I can also play around with a sina/jitter with an overlayed stat_summary (default mean+/-se pointrange geom seems reasonable here).

sjspielman commented 2 years ago

To compare some different ideas, I updated the boxplot to instead be a sina plot with overlayed IQR box (no whiskers), show N in the facet labels, and show all points instead of only the non-0. The x-axis order wasn't updated (yet?) but can be as needed. Any thoughts on this "overall" version? https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/2117ec351a41df550ac14583bd33e4613e0c1613/analyses/mutational-signatures/plots/cns/exposures_boxplot.pdf

sjspielman commented 2 years ago

I've made a few more (draft concepts!! definitely not prime-time ready!! )plots to bring into discussion. Any and all thoughts on pursuing these strategies are welcome!

1) A jitter plot of the mutation counts per mb across samples. 2) A stacked barplot of mutation counts per mb across cancer groups. I do not think this approach works well, so I tried the next approach... 3) Two versions of non-stacked plots of the median mutation counts per mb across cancer groups, EITHER showing all samples or only exposed samples. These plots also have error bars for IQR, but they are quite tightly bounded so error bars may not be useful here.

jaclyn-taroni commented 2 years ago

I like the sina plot with the IQR box best (https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/2117ec351a41df550ac14583bd33e4613e0c1613/analyses/mutational-signatures/plots/cns/exposures_boxplot.pdf). Showing the 0s seems like the move to me.

I do understand the impulse for the stacked barplot. I am a little worried that we're getting a bit too "far away" from the data, though.

sjspielman commented 2 years ago

I like the sina plot with the IQR box best (https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/2117ec351a41df550ac14583bd33e4613e0c1613/analyses/mutational-signatures/plots/cns/exposures_boxplot.pdf). Showing the 0s seems like the move to me.

I tend to agree here, and with this display I like the zeros. My one question about it at this point is whether it's appropriately consistent with the overall plotting style for PBTA as described - "For 2+ group comparisons, we will use violin or boxplots with jitter."

I do understand the impulse for the stacked barplot. I am a little worried that we're getting a bit too "far away" from the data, though.

For this particular data, I don't think stacked works well. But, we do want a figure in there showing mutation per mb or similar to accompany the sina/IQR figure about weights. The sina/IQR does a good job of ballparking proportion exposed, which the bubble plot does not do well (too hard to distinguish circle sizes, even under a variety of settings). We might consider the sina/IQR and a barplot of counts then for only exposed, aka the second one here?

jharenza commented 2 years ago

I like the sina plot with the IQR box best (https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/2117ec351a41df550ac14583bd33e4613e0c1613/analyses/mutational-signatures/plots/cns/exposures_boxplot.pdf). Showing the 0s seems like the move to me.

I also agree with โ˜๏ธ

I do understand the impulse for the stacked barplot. I am a little worried that we're getting a bit too "far away" from the data, though.

For this, I was thinking the stacked barplots would be a per sample plot faceted by cancer group, rather than lumping, which I agree, does not help with telling the story. For example, S3A in this paper. My main impetus for this was to see if there are any patterns within a cancer group, or perhaps mutually exclusive signatures which we can put into the supplement and describe. The sina plot doesn't give that level of granularity, but if we don't see anything interesting, that would be fine.

Another thought is to do a quick overall correlation plot of the signature weights to determine if in brain tumors we see any specific signatures correlating which either were not known to correlate or which are (eg- signatures 2 and 13, APOBEC, often correlate positively in the same samples - same paper, figure S5E - but we do not have those as part of the CNS signatures, thus if we see any correlating, it may be novel).

jharenza commented 2 years ago

The current order is based on "other" signature names. These signatures have two names within the database - those I put int he figures, and alphabetical A-H. So they are currently arranged in A-H order. If this isn't particularly meaningful, then definitely we can move those tot he end.

@sjspielman - yes, I think ordering by traditional numeric signatures (1, 3, 8, etc) would be more meaningful, and maybe a tilted axis text would help :)

jaclyn-taroni commented 2 years ago

Going to synthesize the above to what I think we should do:

sjspielman commented 2 years ago

Here's a first go of the sample barplot as discussed. Since this viz is very color-heavy, I used a colorblind scale instead of Simpsons. I also removed the specific specimen IDs since they are just noise in this figure. If there are some samples we want to highlight further for any reason, I would think that goes into a separate figure.

Something I am wondering if this needs is some ordering of the samples with facets, eg order them by a certain signature exposure?

jharenza commented 2 years ago

Here's a first go of the sample barplot as discussed. Since this viz is very color-heavy, I used a colorblind scale instead of Simpsons. I also removed the specific specimen IDs since they are just noise in this figure. If there are some samples we want to highlight further for any reason, I would think that goes into a separate figure.

Something I am wondering if this needs is some ordering of the samples with facets, eg order them by a certain signature exposure?

Wow, you're so quick! This looks really great and you can see some obvious differences across cancer groups, which make sense (MMR in DMG/HGG, including only MMR in a few samples - we should double check those are the hypermutated samples!). I wonder if we should just order descending by signature 1, which is a universal signature - the more of signature 1 and the less of others, the closer it is to a "normal" sample. Maybe by MMR or sig 3 (BRCA) would be more interesting since I see that while MMR in HGG/DMG is more universal, it not in all samples in other groups - this possibly could associate with subtype or phase of therapy (eg progression/relapse having more mutations and possibly more MMR?).

sjspielman commented 2 years ago

Sounds great, I'll look into 1/3/MMR options for sorting and check about hypermutated samples in this notebook.

sjspielman commented 2 years ago

I've updated the samples barplot to order by signature 1. Among all the options, this one looked the best in part because signature 1 occurs more frequently overall.

I also explored some aspects of how signatures and mutation burdens may relate to tumor descriptors and subtypes. I didn't saved any of those figures specifically because nothing jumped out at me very strongly, but you can see in the previewed notebook with full context if we think there's more to potentially explore here.

jharenza commented 2 years ago

I've updated the samples barplot to order by signature 1. Among all the options, this one looked the best in part because signature 1 occurs more frequently overall.

Thanks! Yes, this makes sense and looks good. It is interesting to me that some samples have other signatures which are so strong they then attenuate Signature 1.

I also explored some aspects of how signatures and mutation burdens may relate to tumor descriptors and subtypes. I didn't saved any of those figures specifically because nothing jumped out at me very strongly, but you can see in the previewed notebook with full context if we think there's more to potentially explore here.

What I was thinking about when I mentioned this was if we look specifically within a cancer group from your plot which are now arranged by Sig 1, as samples lose Sig 1, are those more progressive/relapse/post mortem or a specific subtype (I might only explore MB subtypes and/or HGAT H3 mutant vs not mutant here for simplicity)? I would hypothesize those which lose Sig 1 and gain others might be more enriched for progressive/relapse/post mortem because of possibly more mutations occurring in those samples. Not sure how much we want to explore here, though.

including only MMR in a few samples - we should double check those are the hypermutated samples!

For this, can we do a simple check that all samples with MMR2 signature weight == 1.0 have a high (hyper- or ultrahyper-mutant TMB)?

sjspielman commented 2 years ago

What I was thinking about when I mentioned this was if we look specifically within a cancer group from your plot which are now arranged by Sig 1, as samples lose Sig 1, are those more progressive/relapse/post mortem or a specific subtype (I might only explore MB subtypes and/or HGAT H3 mutant vs not mutant here for simplicity)? I would hypothesize those which lose Sig 1 and gain others might be more enriched for progressive/relapse/post mortem because of possibly more mutations occurring in those samples. Not sure how much we want to explore here, though.

Ah, ok. So this is effectively asking whether the proportion of signature 1 exposure (of all exposures) for a given sample is related to subtypes etc, yes?

For this, can we do a simple check that all samples with MMR2 signature weight == 1.0 have a high (hyper- or ultrahyper-mutant TMB)?

Can do!

jharenza commented 2 years ago

Ah, ok. So this is effectively asking whether the proportion of signature 1 exposure (of all exposures) for a given sample is related to subtypes etc, yes?

yes!

sjspielman commented 2 years ago

The notebook has now been massively cleaned up and is ready for another look @jaclyn-taroni !

Note, the README will be finalized with results once we agree on them.

jharenza commented 2 years ago

@sjspielman can you make your exposures_presence_barplot and sina plots 5 rows x 2 columns? This will fit more perfectly in Figure 3 for whichever we use.

jharenza commented 2 years ago
  • Exploratory analyses without exported viz/tables because no strong trends were observed:

    • Explore signature 1 presence/absence and potential correlates
    • Explore MMR2-dominant samples and potential correlates

Thanks for adding these! I took a look and just had a few comments:

  1. When I ran your 07 script in the docker Rstudio, I got an error that ggforce is required but not installed.
  2. It is interesting and surprising that the ultra-hypermutated tumors (>100 mut/Mb) don't have more of an MMR signature, but perhaps they have some other pattern in common. This could be worth checking, since there are only 4 of these samples (TMB > 100 mut/Mb in tmb_coding).
  3. The signature 1 presence/absence across phases of therapy is really cool to see, even if we have low Ns for non-initial tumors. These analyses are hard because do we group the entire PBTA cohort and look at proportion of Sig 1 by phase of therapy or break down by cancer group or by subtype, etc. I did a little more digging and saw that we do also see this trend in LGG, EPN, and meningioma. Can you add one more figure here to plot the entire PBTA cohort? Not sure yet how to handle, but I think it is worth mentioning in the text.
sjspielman commented 2 years ago

Regarding ggforce, this package is added to the Dockerfile in my branch, so for now you can only run the code if you checkout my branch. Once this gets merged into master, ggforce will be in the docker image more generally.

I'll add to the notebook also:

jaclyn-taroni commented 2 years ago

@sjspielman can you push a version of the notebook with the output rendered please? I don't think all the plots are saved as PDFs and I'd like to take a look at them as I review ๐Ÿ‘€

sjspielman commented 2 years ago

I've made some of the changes, except I can't get the samples barplot vertical. It becomes a very dizzying "optical illusion" with all the tight bars for high N samples if the plot gets narrower than this.

oh no that one is fine as is, bc it can go in the supplement

sjspielman commented 2 years ago

can you push a version of the notebook with the output rendered please? I don't think all the plots are saved as PDFs and I'd like to take a look at them as I review

The most recent push should have everything rendered in Rmd. I haven't exported the exploratory plots, but can quickly do those exports now, stay tuned for another commit in a few min.

jaclyn-taroni commented 2 years ago

@jharenza the discussion here is getting long (there are a bunch of hidden items!) which is usually a sign that we should get this in and then file issues with what else we'd like to see.

sjspielman commented 2 years ago

Noting CI failed because of the CI dataset. I will add a param to this file to only run that particular plot if not CI.