UMAP figures added - Githubissues

runjin326 commented 2 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR generates UMAP plots for figure S4 of the paper.

What was your approach?

For annotation samples: 1) HGG - I kept the HGG, DMG molecular subtyped samples that are NOT To be classified as is, and the rest (including To be classified with HGG and DMG prefix as well as all other tumors) were coded as Other CNS tumor; 2) LGG - I removed theLGGorGNGorGNTprefix in the molecular subtype and just use the molecular subtypes that are notTo be classifiedas is. 3) For MB and EPN - molecular subtyped samples that are NOTTo be classifiedas is, and the rest asOther CNS tumor`.

For generating figures: 1) The molecular subtypes that are not Other CNS tumor would have colors and Other CNS tumor would be grey 2) They are output to the plots folder with the names specifying which cancer group we are plotting.

What GitHub issue does your pull request address?

https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1198

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Check to see whether the molecular subtypes to plot make sense for each cancer group.

Is there anything that you want to discuss further?

1) For recoding samples to Other CNS tumor, some samples have a combination of to-be-classified with other information (e.g., To be classified, TP53 activated) - currently, I recode any molecular subtypes containing To be classified to be Other CNS tumor. Is this the right approach?

2) For LGG, since I removed prefixed and lumped molecular information together - do we want to add LGG to all of them to be clearer?

3) A general question - currently, I put the analysis in a separate folder in the analysis folder (have not written up README.md yet since we might want to move it around). Do we want it to be in a separate folder? If so, I would write up a README. If not, feel free to suggest where would be a better place to store the codes.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Figure - plots directory:

umap-epn-subtypes.pdf
umap-hgg-subtypes.pdf
umap-lgg-subtypes.pdf
umap-mb-subtypes.pdf

What is your summary of the results?

N/A

Reproducibility Checklist

[x] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[x] The analytical code is documented and contains comments.

jaclyn-taroni commented 2 years ago

Hi @runjin326 and @jharenza, I took a look at this and wanted to add some general comments. Before we get into any specific changes we might want to make, we should discuss these ideas!

To me, this seems like figure creation explicitly for the purpose of making publication ready plots. If you agree, I’d encourage you to add a script to figures/scripts (calling it something like supp-subtype-umap.R perhaps) instead of adding a new analysis module. That way we can keep everything used only for that purpose organized there and add it to the figure generation shell script figures/generate-figures.sh. I’ll also add that script and function names should reflect that we’re using UMAP here, rather than t-SNE, if we’re going to specify/describe the method in those places.
In the interest of keeping things consistent/not duplicating functionality, can we consider: are there ways to use the existing functions for making these plots in the transcriptomic-dimension-reduction module? If they don’t currently work for this purpose, is there a way to alter them to make them more general (without it being a considerable undertaking to update other parts of the code)? We do something like this here for the main display item using the UMAP plot.
General comments:
- If we’re able to based on the number of subtypes we’d like to represent in a figure, it’d be great to use the Okabe-Ito palette from colorblindr rather than the two palettes from Colorbrewer.
More specific comments about the figures with many subtypes:
- When DMG and HGG samples share molecular alterations, are there ways to keep the color the same to limit the number of colors in the palette but use shape to distinguish between DMG and HGG?
- There are a lot of different categories in the LGAT plots such that using color palette alone is going to be difficult and some of the colors might be hard to distinguish. Are there molecular alterations that we should prioritize including? For example, if only one sample has an alteration, picking out an individual sample on that scatter plot will be hard unless we want to use text on the plot itself somehow. If we limit the number of alterations we represent using features of all points (i.e., fill, shape, outline), it might result in a visualization that’s easier to interpret.
I don’t think we should be lumping specimens without molecular subtypes in the cancer group under consideration into Other CNS Tumor. We could use different language than “To be classified” if we’d like. (Maybe I didn’t follow that logic correctly.)

Looking forward to your thoughts! Thank you!

runjin326 commented 2 years ago

Hi @jaclyn-taroni, thanks so much for the comments and recommendations. And I will do the following: 1) Move the script to figures/scripts 2) Use the function plot_dimension_reduction to generate visualization 3) Modify the code name to supp-subtype-umap.R to indicate we are using UMAP 4) Use Okabe-Ito palette from colorblindr if necessary (although looks like the function itself should be able to take care of the colors) 5) For HGG, use color to distinguish molecular alterations but shape to distinguish DMG and HGG

I think @jharenza's suggestion on the two remaining questions will be valuable: 1) What to do with LGAT for better interpretation 2) What to do with To be classified in particular cancer group - should we just remove them or rename them as To be classified

jharenza commented 2 years ago

I don’t think we should be lumping specimens without molecular subtypes in the cancer group under consideration into Other CNS Tumor. We could use different language than “To be classified” if we’d like. (Maybe I didn’t follow that logic correctly.)

What to do with To be classified in particular cancer group - should we just remove them or rename them as To be classified

@jaclyn-taroni do you mean that within a broad histology, you would want to see To be classified separate from Other CNS tumor? That makes sense - we could perhaps use a dark grey for To be classified and the light grey as is for Other CNS tumor - thoughts?

When DMG and HGG samples share molecular alterations, are there ways to keep the color the same to limit the number of colors in the palette but use shape to distinguish between DMG and HGG?

Since DMG and HGG would only share the wildtype designation (K28 defines DMG) I was thinking that maybe we could instead combine the TP53 subtypes (get rid of activated/loss), which would remove two more groups. I would also be OK with us removing HGG/DMG in general to make it cleaner.

What to do with LGAT for better interpretation

For this, perhaps we can do:

remove germline/somatic from NF1 (gets rid of 3 groups)
can you make another shape for CDKN2A/B instead of lumping it with the subtype? (gets rid of 3 groups) Then, maybe I can further assess..

jaclyn-taroni commented 2 years ago

General

@jaclyn-taroni do you mean that within a broad histology, you would want to see To be classified separate from Other CNS tumor? That makes sense - we could perhaps use a dark grey for To be classified and the light grey as is for Other CNS tumor - thoughts?

Yep, that's what I mean. This plan sounds good!

HGAT

Since DMG and HGG would only share the wildtype designation (K28 defines DMG) I was thinking that maybe we could instead combine the TP53 subtypes (get rid of activated/loss), which would remove two more groups. I would also be OK with us removing HGG/DMG in general to make it cleaner.

I like where this is headed! But there may be a way to include the TP53 subtypes still.

Since H3 K28 is a defining lesion for DMG, we might consider coloring points based on H3 status (and I guess IDH) and explicitly stating that H3 K28 means DMG, all other samples are HGG in the figure legend. So the colors would represent:

H3 wild type
H3 G35
H3 K28
IDH
To be classified (dark grey)
Other CNS (lighter grey)

You could then represent TP53 status with shape.

LGAT

remove germline/somatic from NF1 (gets rid of 3 groups)

can you make another shape for CDKN2A/B instead of lumping it with the subtype? (gets rid of 3 groups) Then, maybe I can further assess..

Agree that we should make those tweaks to the LGAT visualization and then go from there!

Next steps

@runjin326 if this makes sense on your end/for your workflow, maybe we could iterate on the visualizations in this notebook and then once we're in agreement have you make the figures/scripts changes? I do not have strong feelings about that plan, just wanted to offer that idea.

runjin326 commented 2 years ago

@jharenza and @jaclyn-taroni - thanks for the suggestions. I will make modifications accordingly. Unfortunately I have already moved the scripts to figures/scripts and I am outputting the figures to figures/supp for assessment. I will ping you both when the changes are made.

jharenza commented 2 years ago

You could then represent TP53 status with shape.

Perfect

runjin326 commented 2 years ago

@jaclyn-taroni and @jharenza , I have now made changes to the figures as suggested. In addition to what we discussed above, I also removed MB and EPN to be consistent with LGAT and HGAT.

jaclyn-taroni commented 2 years ago

Thanks for these changes @runjin326! A couple general comments that we should talk through/address before we get into the code.

The color palette for the EPN subtypes can be improved in my opinion. Specifically, I think ST YAP1 and PF A might be challenging to distinguish for readers with deuteranopia (checked with Color Oracle).

We can probably just use the Okabe-Ito palette in almost every case because each panel will have its own legend, rather than sampling from 15+ colors? Picking color palettes has been pretty challenging for this project and we don't always hit the mark, but I think repeating these palettes for small numbers of groups (e.g., 4) is okay.
The LGAT plot still has 14 groups we’re trying to signify with color and it looks like there are some where there are very few samples (e.g., H3). We should find a way to drop some of the labels we’re using color to represent. I’m coming back around to an idea I mentioned earlier: for categories with very few samples, can we use text to label the points instead? Cc: @jharenza

runjin326 commented 2 years ago

@jaclyn-taroni - thanks for the feedback! I have made 2 changes: 1) I am now using the Okabe-Ito palette - since the function itself already coded for how to use the palette and I do not want to mess with the function (and potentially mess up with other module that uses the function), I manually add the hex code and still use sample() to select colors from the palette. 2) For LGAT, for groups with less than 10 samples, I group them into Other LGAT subtypes, color code all of them as black and add text to indicate which one is which subtype.

Let me know whether the figures look good to you now!

jaclyn-taroni commented 2 years ago

Thanks for trying that @runjin326! I discussed this with a few folks at the CCDL because I wasn't sure what the best path forward was. I'm going to summarize my take aways below.

I don't think the labels are going to work out. It kind of implies that those points are most important. But I do think keeping the groups with less than 10 samples together is a good idea in principle. We should also sort all the "other" categories such that they are plotted first. We'd like the other points to be on "top" of the plot so they are more visible, but that means they should be last in the data.

However, the bigger issue is: What is the message?

To me, it seems like the most interesting patterns might have to do with the BRAF alterations, RTK, and wildtype. If that's true, I think we could just highlight those groups cc: @jharenza

Another option we could take to retain more groups is to facet based on subtype.

Re: the palette – BRAF V600E and RTK are currrently hard to tell apart. I'd recommend selecting specific indices of the vector of hex codes, rather than sampling so we have more control over that.

runjin326 commented 2 years ago

@jaclyn-taroni - thanks so much for the feedback and I have made the following changes: 1) Removed text from LGG figures 2) Index the hex code selected rather than sample them 3) Kept other lgat tumors group in the LGAT figure for all subtypes that have <10 samples 4) Plot Other CNS Tumor first and then To be classified and then the rest for better visualization.

What I have not modified is the LGAT highlighting only the BRAF alterations, RTK, and wildtype - feedback from @jharenza would be great on this. If we do group them together then we would need to explain in the methodology as to why we group like that.

Facet is another option and I can try that out if desired.

runjin326 commented 2 years ago

@jharenza , thanks so much for reviewing this! I have now changed the not altered in CDKN status in LGG as circle. Should be ready for merge.

sjspielman commented 2 years ago

Hi all, I made a few comments in the code where a few minor items can be cleaned up, and then it looks good to go! Importantly, it does look like the branch needs to be updated to master, so let's make sure this branch is up-to-date before the merge.

runjin326 commented 2 years ago

Hi all, I made a few comments in the code where a few minor items can be cleaned up, and then it looks good to go! Importantly, it does look like the branch needs to be updated to master, so let's make sure this branch is up-to-date before the merge.

Thanks so much for reviewing this! I have now merged the most up-to-date master to this branch. Additionally, I added the hex code and data release version as variables up-front in the code. We do not really have a run bash script so that I can't really add them as input variables (and I think it might not be necessary either since we might not need to run all the scripts in the folder over and over again). As to the data release, it was originally determined to be specifically tied to release v21 to avoid confusions (I believe it is because the paper will largely base on v21 release and the figures are for the paper).

Let me know if there is anything else you want me to modify :)

sjspielman commented 2 years ago

We do not really have a run bash script so that I can't really add them as input variables (and I think it might not be necessary either since we might not need to run all the scripts in the folder over and over again)

Makes sense! The updates look good to me, so I'll go ahead and approve.

sjspielman commented 2 years ago

This all looks ready to go, merging in!

AlexsLemonade / OpenPBTA-analysis

UMAP figures added #1213

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

General

HGAT

LGAT

Next steps