AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

#1144 part 4 Rerun transcriptomics fig generation with cancer_group for UMAP #1171

Closed kgaonkar6 closed 3 years ago

kgaonkar6 commented 3 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

The request was to update umap in the main figures to use cancer_group ( gsva and immune_deconv plot update are not required as of the time this PR was submitted) .

What was your approach?

I updated figures/scripts/transcriptomic-overview.R to use cancer_group and related cancer_group_hex_codes [here:] (https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/scripts/transcriptomic-overview.R#L120-L132)

But needed to rerun collapse-rnaseq https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/generate-figures.sh#L100

immune_deconv https://github.com/kgaonkar6/OpenPBTA-analysis/blob/da4b171e9e7bfbf63d478c35b25af948f2e1a0f8/figures/generate-figures.sh#L114

To be able to rerun Rscript --vanilla scripts/transcriptomic-overview.R with the cancer-group update requested.

What GitHub issue does your pull request address?

1144

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

yes

Results

What types of results are included (e.g., table, figure)?

figure

What is your summary of the results?

Updated umap color palette

Reproducibility Checklist

Documentation Checklist

jaclyn-taroni commented 3 years ago

I'm pretty sure something is off with the UMAP figure -- if you look at how many Choroid plexus tumors there are in the UMAP figure, you can see it is many more than you would expect based on the other panels. (I think those bright green samples are in fact embryonal tumors based on earlier results.)

Is it because the transcriptomic-dimension-reduction module also needs to be rerun maybe? https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/d3a7edf8b3153792bce4c985865cabd0adf12747/figures/generate-figures.sh#L97

jaclyn-taroni commented 3 years ago

The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?

I think this https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-911966313 answers this question because I think that means that the legend that I used to review the plot under review isn't the legend for that plot.

But there is another potential problem – display_group and cancer_group have the same palette but encode different things?

kgaonkar6 commented 3 years ago

Thanks for the review! display_group ( derived from broad_histology) and cancer_group have different palettes. However, the legend section in the plot that is being reviewed displays the display_group palette which was being used for all plots as in master. Right now only gsva and immune annotation rows use the display_group.

I recreated the umap file with it's own legend to clarify the colors in umap. It seems to me that colors look right. For example LGG is the dark red color accounts for majority of the data and bright green in medullo a subset of embryonal tumors. temp_umap

The legend show display_group because the immune_deconv and gsva still use those terms , should we update those to cancer_group as well?

Here I meant, should we update the annotation rows in gsva and immune to cancer_group so then we can keep just one legend in the plot and we don't have to keep multiple palettes in this figure.

jaclyn-taroni commented 3 years ago

"Same palette" was not quite the right wording. This is off topic from this PR, but worth figuring out as we're putting together figures.

A question I have is: Will display_group be used in any figures? My understanding is yes, both cancer_group and display_group are expected to be used because they have both been retained in figures/mapping-histology-labels.Rmd.

If the display_group color for Choroid plexus tumor is similar to the cancer_group color for Medulloblastoma or #ff0000 means Subependymal Giant Cell Astrocytoma sometimes and Tumor of cranial and paraspinal nerves other times, because the colors have been assigned randomly, that could be confusing for folks.

kgaonkar6 commented 3 years ago

"Same palette" was not quite the right wording. This is off topic from this PR, but worth figuring out as we're putting together figures.

A question I have is: Will display_group be used in any figures? My understanding is yes, both cancer_group and display_group are expected to be used because they have both been retained in figures/mapping-histology-labels.Rmd.

Yes you are right, currently the requirement was to use both display_group and cancer_group in figures.

If the display_group color for Choroid plexus tumor is similar to the cancer_group color for Medulloblastoma or #ff0000 means Subependymal Giant Cell Astrocytoma sometimes and Tumor of cranial and paraspinal nerves other times, because the colors have been assigned randomly, that could be confusing for folks.

Yeah I agree, maybe we can have a palette created such that each display_group and cancer_group have different colors ?

jaclyn-taroni commented 3 years ago

I thought more about this over the weekend and will file a discussion issue soon (today), but a "preview" – in the draft of the interaction plot in Google slides (Fig 3), we use the cancer_group colors for 10 groups and then lump all other groups together using a gray color and the label Other. It seems like we could do that throughout, effectively replacing display_group with the "abbreviated" version of cancer_group in main display items. So in that same figure, the panel that is probably D (?) which is a bar plot, you could only use colors for the 10 groups (that I assume were selected based on sample size) and keep all other groups gray.

jharenza commented 3 years ago

The only thing about lumping others into Other is that we will want the actual cancer group on the oncoprint (many are cut out due to having zero mutations, but there are some Ns of 1 here) and ideally the sample distribution plot (though this one may be less important and we can perhaps have a supplemental table of Ns). Having so many colors is hard because we have so many groups. I forgot for PPTC, I just used the TCGA palette but we have many more groups here to be able to have distinguishable and aesthetically pleasing colors.

jaclyn-taroni commented 3 years ago

The only thing about lumping others into Other is that we will want the actual cancer group on the oncoprint (many are cut out due to having zero mutations, but there are some Ns of 1 here) and ideally the sample distribution plot (though this one may be less important and we can perhaps have a supplemental table of Ns). Having so many colors is hard because we have so many groups.

Yes, looking at the oncoprint and sample distribution plots too and will try to come up with a holistic plan.

I think the best course of action for main display items may be to drop Other in the oncoprint when below a certain N in the interest is using the lowest number of cancer_group labels feasible and therefore doing the best that we can wrt having colors in the palette be distinct and accessible (the latter will most likely be difficult with the number we end up with).

We should rely on labels as heavily as possible (e.g., in box plots or bar plots) and then use the color palette as a way to make it easy for readers to scan & gather information across figures.

jaclyn-taroni commented 3 years ago

All of that being said - a scatter plot with a 58 color palette and alpha < 1 is going to be problematic.

jaclyn-taroni commented 3 years ago

Closing this PR in light of the fact that we will have updated palettes (#1176) and we will need to change how this figure is put together based on what is in Figure 4 in the Google Slides right now.