Closed runjin326 closed 2 years ago
Hi @runjin326 and @jharenza, I took a look at this and wanted to add some general comments. Before we get into any specific changes we might want to make, we should discuss these ideas!
figures/scripts
(calling it something like supp-subtype-umap.R
perhaps) instead of adding a new analysis module. That way we can keep everything used only for that purpose organized there and add it to the figure generation shell script figures/generate-figures.sh
. I’ll also add that script and function names should reflect that we’re using UMAP here, rather than t-SNE, if we’re going to specify/describe the method in those places.transcriptomic-dimension-reduction
module? If they don’t currently work for this purpose, is there a way to alter them to make them more general (without it being a considerable undertaking to update other parts of the code)? We do something like this here for the main display item using the UMAP plot.colorblindr
rather than the two palettes from Colorbrewer.Other CNS Tumor
. We could use different language than “To be classified” if we’d like. (Maybe I didn’t follow that logic correctly.)Looking forward to your thoughts! Thank you!
Hi @jaclyn-taroni, thanks so much for the comments and recommendations. And I will do the following:
1) Move the script to figures/scripts
2) Use the function plot_dimension_reduction
to generate visualization
3) Modify the code name to supp-subtype-umap.R
to indicate we are using UMAP
4) Use Okabe-Ito palette from colorblindr
if necessary (although looks like the function itself should be able to take care of the colors)
5) For HGG, use color to distinguish molecular alterations but shape to distinguish DMG and HGG
I think @jharenza's suggestion on the two remaining questions will be valuable:
1) What to do with LGAT for better interpretation
2) What to do with To be classified
in particular cancer group - should we just remove them or rename them as To be classified
I don’t think we should be lumping specimens without molecular subtypes in the cancer group under consideration into Other CNS Tumor. We could use different language than “To be classified” if we’d like. (Maybe I didn’t follow that logic correctly.)
What to do with To be classified in particular cancer group - should we just remove them or rename them as To be classified
@jaclyn-taroni do you mean that within a broad histology, you would want to see To be classified
separate from Other CNS tumor
? That makes sense - we could perhaps use a dark grey for To be classified
and the light grey as is for Other CNS tumor
- thoughts?
When DMG and HGG samples share molecular alterations, are there ways to keep the color the same to limit the number of colors in the palette but use shape to distinguish between DMG and HGG?
Since DMG and HGG would only share the wildtype designation (K28 defines DMG) I was thinking that maybe we could instead combine the TP53 subtypes (get rid of activated/loss), which would remove two more groups. I would also be OK with us removing HGG/DMG in general to make it cleaner.
What to do with LGAT for better interpretation
For this, perhaps we can do:
@jaclyn-taroni do you mean that within a broad histology, you would want to see
To be classified
separate fromOther CNS tumor
? That makes sense - we could perhaps use a dark grey forTo be classified
and the light grey as is forOther CNS tumor
- thoughts?
Yep, that's what I mean. This plan sounds good!
Since DMG and HGG would only share the wildtype designation (K28 defines DMG) I was thinking that maybe we could instead combine the TP53 subtypes (get rid of activated/loss), which would remove two more groups. I would also be OK with us removing HGG/DMG in general to make it cleaner.
I like where this is headed! But there may be a way to include the TP53 subtypes still.
Since H3 K28 is a defining lesion for DMG, we might consider coloring points based on H3 status (and I guess IDH) and explicitly stating that H3 K28 means DMG, all other samples are HGG in the figure legend. So the colors would represent:
You could then represent TP53 status with shape.
- remove germline/somatic from NF1 (gets rid of 3 groups)
- can you make another shape for CDKN2A/B instead of lumping it with the subtype? (gets rid of 3 groups) Then, maybe I can further assess..
Agree that we should make those tweaks to the LGAT visualization and then go from there!
@runjin326 if this makes sense on your end/for your workflow, maybe we could iterate on the visualizations in this notebook and then once we're in agreement have you make the figures/scripts
changes? I do not have strong feelings about that plan, just wanted to offer that idea.
@jharenza and @jaclyn-taroni - thanks for the suggestions. I will make modifications accordingly. Unfortunately I have already moved the scripts to figures/scripts
and I am outputting the figures to figures/supp
for assessment. I will ping you both when the changes are made.
You could then represent TP53 status with shape.
Perfect
@jaclyn-taroni and @jharenza , I have now made changes to the figures as suggested. In addition to what we discussed above, I also removed MB
and EPN
to be consistent with LGAT and HGAT.
Thanks for these changes @runjin326! A couple general comments that we should talk through/address before we get into the code.
The color palette for the EPN subtypes can be improved in my opinion. Specifically, I think ST YAP1 and PF A might be challenging to distinguish for readers with deuteranopia (checked with Color Oracle).
We can probably just use the Okabe-Ito palette in almost every case because each panel will have its own legend, rather than sampling from 15+ colors? Picking color palettes has been pretty challenging for this project and we don't always hit the mark, but I think repeating these palettes for small numbers of groups (e.g., 4) is okay.
The LGAT plot still has 14 groups we’re trying to signify with color and it looks like there are some where there are very few samples (e.g., H3). We should find a way to drop some of the labels we’re using color to represent. I’m coming back around to an idea I mentioned earlier: for categories with very few samples, can we use text to label the points instead? Cc: @jharenza
@jaclyn-taroni - thanks for the feedback! I have made 2 changes:
1) I am now using the Okabe-Ito palette
- since the function itself already coded for how to use the palette and I do not want to mess with the function (and potentially mess up with other module that uses the function), I manually add the hex code and still use sample()
to select colors from the palette.
2) For LGAT, for groups with less than 10 samples, I group them into Other LGAT subtypes
, color code all of them as black and add text to indicate which one is which subtype.
Let me know whether the figures look good to you now!
Thanks for trying that @runjin326! I discussed this with a few folks at the CCDL because I wasn't sure what the best path forward was. I'm going to summarize my take aways below.
I don't think the labels are going to work out. It kind of implies that those points are most important. But I do think keeping the groups with less than 10 samples together is a good idea in principle. We should also sort all the "other" categories such that they are plotted first. We'd like the other points to be on "top" of the plot so they are more visible, but that means they should be last in the data.
However, the bigger issue is: What is the message?
To me, it seems like the most interesting patterns might have to do with the BRAF alterations, RTK, and wildtype. If that's true, I think we could just highlight those groups cc: @jharenza
Another option we could take to retain more groups is to facet based on subtype.
Re: the palette – BRAF V600E and RTK are currrently hard to tell apart. I'd recommend selecting specific indices of the vector of hex codes, rather than sampling so we have more control over that.
@jaclyn-taroni - thanks so much for the feedback and I have made the following changes:
1) Removed text from LGG figures
2) Index the hex code selected rather than sample them
3) Kept other lgat tumors
group in the LGAT figure for all subtypes that have <10 samples
4) Plot Other CNS Tumor
first and then To be classified
and then the rest for better visualization.
What I have not modified is the LGAT highlighting only the BRAF alterations, RTK, and wildtype - feedback from @jharenza would be great on this. If we do group them together then we would need to explain in the methodology as to why we group like that.
Facet is another option and I can try that out if desired.
@jharenza , thanks so much for reviewing this! I have now changed the not altered
in CDKN status in LGG as circle. Should be ready for merge.
Hi all, I made a few comments in the code where a few minor items can be cleaned up, and then it looks good to go! Importantly, it does look like the branch needs to be updated to master, so let's make sure this branch is up-to-date before the merge.
Hi all, I made a few comments in the code where a few minor items can be cleaned up, and then it looks good to go! Importantly, it does look like the branch needs to be updated to master, so let's make sure this branch is up-to-date before the merge.
Thanks so much for reviewing this! I have now merged the most up-to-date master to this branch. Additionally, I added the hex code and data release version as variables up-front in the code. We do not really have a run bash script so that I can't really add them as input variables (and I think it might not be necessary either since we might not need to run all the scripts in the folder over and over again). As to the data release, it was originally determined to be specifically tied to release v21 to avoid confusions (I believe it is because the paper will largely base on v21 release and the figures are for the paper).
Let me know if there is anything else you want me to modify :)
We do not really have a run bash script so that I can't really add them as input variables (and I think it might not be necessary either since we might not need to run all the scripts in the folder over and over again)
Makes sense! The updates look good to me, so I'll go ahead and approve.
This all looks ready to go, merging in!
Purpose/implementation Section
What scientific question is your analysis addressing?
This PR generates UMAP plots for figure S4 of the paper.
What was your approach?
For annotation samples: 1) HGG - I kept the HGG, DMG molecular subtyped samples that are NOT
To be classified
as is, and the rest (includingTo be classified
with HGG and DMG prefix as well as all other tumors) were coded asOther CNS tumor; 2) LGG - I removed the
LGGor
GNGor
GNTprefix in the molecular subtype and just use the molecular subtypes that are not
To be classifiedas is. 3) For MB and EPN - molecular subtyped samples that are NOT
To be classifiedas is, and the rest as
Other CNS tumor`.For generating figures: 1) The molecular subtypes that are not
Other CNS tumor
would have colors andOther CNS tumor
would be grey 2) They are output to theplots
folder with the names specifying which cancer group we are plotting.What GitHub issue does your pull request address?
https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1198
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Check to see whether the molecular subtypes to plot make sense for each cancer group.
Is there anything that you want to discuss further?
1) For recoding samples to
Other CNS tumor
, some samples have a combination of to-be-classified with other information (e.g.,To be classified, TP53 activated
) - currently, I recode any molecular subtypes containingTo be classified
to beOther CNS tumor
. Is this the right approach?2) For LGG, since I removed prefixed and lumped molecular information together - do we want to add
LGG
to all of them to be clearer?3) A general question - currently, I put the analysis in a separate folder in the
analysis
folder (have not written up README.md yet since we might want to move it around). Do we want it to be in a separate folder? If so, I would write up a README. If not, feel free to suggest where would be a better place to store the codes.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
Figure -
plots
directory:What is your summary of the results?
N/A
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.