Idea: Create co-occurrence plots per cancer group, assemble into multipanel figure

jaclyn-taroni commented 3 years ago

Context & idea

Right now the output of interaction-plots looks like this: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/4240cc64ffcce2607717be63ab32acb81cbd299b/analyses/interaction-plots/plots/combined_top50.png

The most important point I want to make about that plot is that all samples are combined in this panel and it uses the top 50 genes (with some FLAGS filtering, if I recall correctly). There was originally an idea to use a different (more curated) gene list but plot all samples together (#1001). My idea with this issue is to go in another direction entirely: split up the interaction plots by cancer_group.

We ended up including cancer_group in part because of #917. I'll quote from the initial post on that issue:

In #915, the interaction plots module is being updated to use broad_histology because integrated_diagnosis is less complete in v18 and harmonized_diagnosis has many different values. I don't think this is quite right.

The question is: What is the right disease label/grouping to use for the interaction plots module for the co-occurrence information to be useful? I suspect that the "right" grouping might come from dropping harmonized_diagnosis values with small sample sizes and combining others.

One concern about using gene lists with the interaction plots is that we'd end up replicating a lot of the same information that's in the oncoprints (https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1001#issuecomment-819547278).

Splitting up by cancer_group will allow us to add information over and above the oncoprints we expect to include in the main text – namely, we can include molecular_subtype (or harmonized_diagnosis if deemed more appropriate) as I'll show in my sketch below – regardless of whether we use a "top n" or gene list approach.

Then we could assemble the bar plot-tile plot pairs for individual cancer_group into a multipanel figure, which will likely end up as a single panel in a multipanel figure.

Sketch of idea

Big thing to note is the use of cross-hatching to indicate whatever narrower category we'd like to use (e.g., molecular_subtype). I think we can do this with ggpattern.

Next steps

Before I invest time in this, I thought I'd get input from others.

jharenza commented 3 years ago

I don't think we have a high enough N for this, as we don't have a high enough N for the broad histology groups. I wish! :) but that is why I stuck with the plot of all cancers together, because we do see expected co-occurrences in that plot we can describe.

jaclyn-taroni commented 3 years ago

Was thinking about Low-grade glioma astrocytoma, Medulloblastoma, High-grade glioma astrocytoma, and maybe Ependymoma, and Diffuse midline glioma (based on numbers here: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1174#issuecomment-915372913) but you're right the other plots in analyses/interaction-plots/plots/ do look bleak re: N. (I neglected to take a look at those before filing.)

Okay happy to close this, I am intrigued about ggpattern and the possibility of using it to add on molecular_subtype info somewhere.

AlexsLemonade / OpenPBTA-analysis