Closed kgaonkar6 closed 3 years ago
I'm pretty sure something is off with the UMAP figure -- if you look at how many Choroid plexus tumors there are in the UMAP figure, you can see it is many more than you would expect based on the other panels. (I think those bright green samples are in fact embryonal tumors based on earlier results.)
Is it because the transcriptomic-dimension-reduction
module also needs to be rerun maybe? https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/d3a7edf8b3153792bce4c985865cabd0adf12747/figures/generate-figures.sh#L97
The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?
I think this https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-911966313 answers this question because I think that means that the legend that I used to review the plot under review isn't the legend for that plot.
But there is another potential problem – display_group
and cancer_group
have the same palette but encode different things?
Thanks for the review! display_group ( derived from broad_histology) and cancer_group have different palettes. However, the legend section in the plot that is being reviewed displays the display_group palette which was being used for all plots as in master. Right now only gsva and immune annotation rows use the display_group.
I recreated the umap file with it's own legend to clarify the colors in umap. It seems to me that colors look right. For example LGG is the dark red color accounts for majority of the data and bright green in medullo a subset of embryonal tumors.
The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?
Here I meant, should we update the annotation rows in gsva and immune to cancer_group so then we can keep just one legend in the plot and we don't have to keep multiple palettes in this figure.
"Same palette" was not quite the right wording. This is off topic from this PR, but worth figuring out as we're putting together figures.
A question I have is: Will display_group
be used in any figures? My understanding is yes, both cancer_group
and display_group
are expected to be used because they have both been retained in figures/mapping-histology-labels.Rmd
.
If the display_group
color for Choroid plexus tumor is similar to the cancer_group
color for Medulloblastoma or #ff0000
means Subependymal Giant Cell Astrocytoma
sometimes and Tumor of cranial and paraspinal nerves
other times, because the colors have been assigned randomly, that could be confusing for folks.
"Same palette" was not quite the right wording. This is off topic from this PR, but worth figuring out as we're putting together figures.
A question I have is: Will
display_group
be used in any figures? My understanding is yes, bothcancer_group
anddisplay_group
are expected to be used because they have both been retained infigures/mapping-histology-labels.Rmd
.
Yes you are right, currently the requirement was to use both display_group and cancer_group in figures.
If the
display_group
color for Choroid plexus tumor is similar to thecancer_group
color for Medulloblastoma or#ff0000
meansSubependymal Giant Cell Astrocytoma
sometimes andTumor of cranial and paraspinal nerves
other times, because the colors have been assigned randomly, that could be confusing for folks.
Yeah I agree, maybe we can have a palette created such that each display_group and cancer_group have different colors ?
I thought more about this over the weekend and will file a discussion issue soon (today), but a "preview" – in the draft of the interaction plot in Google slides (Fig 3), we use the cancer_group
colors for 10 groups and then lump all other groups together using a gray color and the label Other
. It seems like we could do that throughout, effectively replacing display_group
with the "abbreviated" version of cancer_group
in main display items. So in that same figure, the panel that is probably D (?) which is a bar plot, you could only use colors for the 10 groups (that I assume were selected based on sample size) and keep all other groups gray.
The only thing about lumping others into Other is that we will want the actual cancer group on the oncoprint (many are cut out due to having zero mutations, but there are some Ns of 1 here) and ideally the sample distribution plot (though this one may be less important and we can perhaps have a supplemental table of Ns). Having so many colors is hard because we have so many groups. I forgot for PPTC, I just used the TCGA palette but we have many more groups here to be able to have distinguishable and aesthetically pleasing colors.
The only thing about lumping others into Other is that we will want the actual cancer group on the oncoprint (many are cut out due to having zero mutations, but there are some Ns of 1 here) and ideally the sample distribution plot (though this one may be less important and we can perhaps have a supplemental table of Ns). Having so many colors is hard because we have so many groups.
Yes, looking at the oncoprint and sample distribution plots too and will try to come up with a holistic plan.
I think the best course of action for main display items may be to drop Other
in the oncoprint when below a certain N in the interest is using the lowest number of cancer_group
labels feasible and therefore doing the best that we can wrt having colors in the palette be distinct and accessible (the latter will most likely be difficult with the number we end up with).
We should rely on labels as heavily as possible (e.g., in box plots or bar plots) and then use the color palette as a way to make it easy for readers to scan & gather information across figures.
All of that being said - a scatter plot with a 58 color palette and alpha < 1 is going to be problematic.
Closing this PR in light of the fact that we will have updated palettes (#1176) and we will need to change how this figure is put together based on what is in Figure 4 in the Google Slides right now.
Purpose/implementation Section
What scientific question is your analysis addressing?
The request was to update umap in the main figures to use cancer_group ( gsva and immune_deconv plot update are not required as of the time this PR was submitted) .
What was your approach?
I updated figures/scripts/transcriptomic-overview.R to use cancer_group and related cancer_group_hex_codes [here:] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/scripts/transcriptomic-overview.R#L120-L132)
But needed to rerun
collapse-rnaseq
https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/generate-figures.sh#L100immune_deconv
https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/generate-figures.sh#L114To be able to rerun
Rscript --vanilla scripts/transcriptomic-overview.R
with the cancer-group update requested.What GitHub issue does your pull request address?
1144
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is there anything that you want to discuss further?
The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
yes
Results
What types of results are included (e.g., table, figure)?
figure
What is your summary of the results?
Updated umap color palette
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.