Closed jaclyn-taroni closed 3 years ago
I'll also note another, possibly more straightforward way to limit the palette would be to only consider cancer_group
where N >= 10 (18 colors).
I'll also note another, possibly more straightforward way to limit the palette would be to only consider
cancer_group
where N >= 10 (18 colors).
I like this idea, more general and we will capture most of the data without prior histology based selection.
I know I said I would file a new issues about how to do this, but thinking ahead a tad: Functionally, if we were to have an individual hex code for 18 cancer_group
values (or 15 depending on the outcome of discussion) and then one (gray) hex code for all other cancer_group
values that would become the display_group
. So the notebook for creating display_group
would get updated to use cancer_group
instead of broad_histology
to create display_group
.
I think I have to check on which cancer_group
are >=10, but I think for all figures other than the oncoprint, this may be fine. For the oncoprint, we should still annotate as the specific cancer. However, I think we also want to keep the broad_histology
for some plots - eg the transcriptomic overview, so all points have a color, and the GSEA plot, I was thinking could have two annotations - one for broad, one more detailed. But, this also depends on whether the >=10 gets more detailed or if it is just the same samples being colored.
Here's the list of cancer_group that are >10, I'm using release-v21-20210820 pbta-histologies.tsv , I've then checked for overlap with display_groups ( derived from broad_histology) below:
# cancer_group n>10 not in display_group
> pbta_hist %>% select(Kids_First_Participant_ID,cancer_group) %>%
unique() %>%
group_by(cancer_group) %>%
tally() %>% filter(n>10) %>%
filter(!cancer_group %in% histologies_color_key_df$display_group)
# A tibble: 14 × 2
cancer_group n
<chr> <int>
1 Atypical Teratoid Rhabdoid Tumor 28
2 Choroid plexus papilloma 14
3 CNS Embryonal tumor 13
4 Craniopharyngioma 38
5 Diffuse midline glioma 55
6 Dysembryoplastic neuroepithelial tumor 26
7 Ependymoma 86
8 Ganglioglioma 46
9 High-grade glioma astrocytoma 84
10 Low-grade glioma astrocytoma 229
11 Medulloblastoma 118
12 Neurofibroma Plexiform 19
13 Schwannoma 16
14 NA 815
# cancer_group n>10 in display_group
> pbta_hist %>% select(Kids_First_Participant_ID,cancer_group) %>%
unique() %>%
group_by(cancer_group) %>%
tally() %>% filter(n>10) %>%
filter(cancer_group %in% histologies_color_key_df$display_group)
# A tibble: 1 × 2
cancer_group n
<chr> <int>
1 Meningioma 27
I think I have to check on which cancer_group are >=10, but I think for all figures other than the oncoprint, this may be fine. For the oncoprint, we should still annotate as the specific cancer.
If the specific cancer does not meet this criteria, we could include the oncoprint in the supplemental material instead and possibly even split up by cancer_group
, rather than broad_histology
, which would allow us to avoid having to worry about colors for specific cancer groups with N < 10.
However, I think we also want to keep the broad_histology for some plots - eg the transcriptomic overview, so all points have a color, and the GSEA plot, I was thinking could have two annotations - one for broad, one more detailed. But, this also depends on whether the >=10 gets more detailed or if it is just the same samples being colored.
Depending on what you find, I'd consider: How would we have two separate palettes without the colors overlapping (or being very close)? I am concerned about potentially causing confusion in the main text. It is also worth considering:
cancer_group
(or even broad_histology
if done with care) so we are not so reliant on color (like the OncoPrint example above ☝🏻 )?cancer_group
on the UMAP plot?A few semi-random thoughts on this:
The notebook to assign colors was originally designed for the case where we did not know what the final groups were going to be, and names and numbers were changing. The fact that it assigned colors semi-randomly was a function of that constraint. At this stage, we should not be doing random assignment, and we will likely want to have a separate table(s) with cancer_group
, display_group
and the assigned colors.
cancer_group
is fully nested within broad_histology
(each cancer group is within a single broad histology), this can be one table, and this will also allow us to see more easily how the two assignments interact. If the oncoprint is the only place where we are using all of the cancer groups, then we can avoid a lot of interpretation trouble by ensuring that the order of the cancer groups in the legend is the same as the order in the figures (with ordered factors, by count of each type?)
broad_histology
, then we might not want want to keep these two color palettes related by hue, as that will make distinguishing within a group in this figure harderFinally, I would just reiterate that we should distinguish where colors are required for visualization, and where they are an aid. In the former case (scatter plots, e.g. UMAP), we can't have more than about 15 colors and maintain interpretability (really more like 8-10). In the latter case (bar plots, etc.), we should expect labels to do the heavy lifting, as @jaclyn-taroni said, with colors to make scanning across figures a bit easier for those cases that are distinct.
For everyone's reference, re: the relationship between broad_histology
and cancer_group
(using v21)
histologies_df %>%
filter(sample_type == "Tumor") %>%
select(sample_id, broad_histology, cancer_group) %>%
distinct() %>%
group_by(broad_histology, cancer_group) %>%
tally()
broad_histology cancer_group n
Benign tumor Adenoma 4
Benign tumor Atypical choroid plexus papilloma 2
Benign tumor Choroid plexus papilloma 14
Benign tumor NA 19
Chordoma Chordoma 6
Choroid plexus tumor Choroid plexus carcinoma 4
Choroid plexus tumor Choroid plexus cyst 1
Diffuse astrocytic and oligodendroglial tumor Diffuse intrinsic pontine glioma 10
Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma 82
Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma 103
Diffuse astrocytic and oligodendroglial tumor Oligodendroglioma 2
Embryonal tumor Atypical Teratoid Rhabdoid Tumor 32
Embryonal tumor CNS Embryonal tumor 13
Embryonal tumor CNS neuroblastoma 3
Embryonal tumor Embryonal tumor with multilayer rosettes 7
Embryonal tumor Ganglioneuroblastoma 3
Embryonal tumor Medulloblastoma 127
Embryonal tumor Neuroblastoma 2
Ependymal tumor Ependymoma 97
Germ cell tumor Germinoma 4
Germ cell tumor Germinoma-Teratoma 1
Germ cell tumor Teratoma 10
Histiocytic tumor Juvenile xanthogranuloma 2
Histiocytic tumor Langerhans Cell histiocytosis 4
Histiocytic tumor Rosai-Dorfman disease 1
Low-grade astrocytic tumor Diffuse fibrillary astrocytoma 1
Low-grade astrocytic tumor Ganglioglioma 50
Low-grade astrocytic tumor Low-grade glioma astrocytoma 248
Low-grade astrocytic tumor Pilocytic astrocytoma 3
Low-grade astrocytic tumor Pleomorphic xanthoastrocytoma 2
Low-grade astrocytic tumor Subependymal Giant Cell Astrocytoma 4
Lymphoma CNS Burkitt's lymphoma 1
Melanocytic tumor Melanocytic tumor 1
Meningioma Meningioma 32
Mesenchymal non-meningothelial tumor Cavernoma 2
Mesenchymal non-meningothelial tumor Ewing sarcoma 11
Mesenchymal non-meningothelial tumor Fibromyxoid lesion 1
Mesenchymal non-meningothelial tumor Hemangioblastoma 3
Mesenchymal non-meningothelial tumor Myofibroblastoma 1
Mesenchymal non-meningothelial tumor Rhabdomyosarcoma 2
Mesenchymal non-meningothelial tumor Sarcoma 6
Metastatic tumors Metastatic secondary tumors 5
Metastatic tumors Metastatic secondary tumors-Neuroblastoma 3
Neuronal and mixed neuronal-glial tumor Desmoplastic infantile astrocytoma and ganglioglioma 3
Neuronal and mixed neuronal-glial tumor Diffuse leptomeningeal glioneuronal tumor 1
Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor 26
Neuronal and mixed neuronal-glial tumor Dysplasia Gliosis-Glial-neuronal tumor NOS 1
Neuronal and mixed neuronal-glial tumor Glial-neuronal tumor NOS 9
Neuronal and mixed neuronal-glial tumor Neurocytoma 3
Neuronal and mixed neuronal-glial tumor Rosette-forming glioneuronal tumor 1
Non-CNS tumor Myxoid spindle cell tumor 1
Non-tumor Arteriovenous malformation 1
Non-tumor Epilepsy 1
Non-tumor Reactive connective tissue 1
Other tumor Ganglioneuroma 1
Pre-cancerous lesion NA 14
Tumor of cranial and paraspinal nerves Malignant peripheral nerve sheath tumor 4
Tumor of cranial and paraspinal nerves Neurofibroma Plexiform 23
Tumor of cranial and paraspinal nerves Schwannoma 19
Tumor of pineal region Pineoblastoma 4
Tumors of sellar region Craniopharyngioma 38
Edit then filtering by n > 10
:
broad_histology cancer_group n
<chr> <chr> <int>
1 Benign tumor Choroid plexus papilloma 14
2 Benign tumor NA 19
3 Diffuse astrocytic and oligodendroglial tumor Diffuse intrinsic pontine glioma 10
4 Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma 82
5 Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma 103
6 Embryonal tumor Atypical Teratoid Rhabdoid Tumor 32
7 Embryonal tumor CNS Embryonal tumor 13
8 Embryonal tumor Medulloblastoma 127
9 Ependymal tumor Ependymoma 97
10 Germ cell tumor Teratoma 10
11 Low-grade astrocytic tumor Ganglioglioma 50
12 Low-grade astrocytic tumor Low-grade glioma astrocytoma 248
13 Meningioma Meningioma 32
14 Mesenchymal non-meningothelial tumor Ewing sarcoma 11
15 Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor 26
16 Pre-cancerous lesion NA 14
17 Tumor of cranial and paraspinal nerves Neurofibroma Plexiform 23
18 Tumor of cranial and paraspinal nerves Schwannoma 19
19 Tumors of sellar region Craniopharyngioma 38
So 19, using that methodology, not 18 but would be 17 if cancer_group == NA
dropped.
Okay the individual colors here need work to be sure, but I'm going to post to convey the point. We could create a broad_histology
palette used for things like the UMAP plot where it is comprised of only broad_histology
where N >= 10 in at least 1 cancer_group
within the broad_histology
to get an 11 color palette like so:
broad_histology |
HSV value | hex code |
---|---|---|
Benign tumor | hsv(336°,100%,35%) | #590024 |
Diffuse astrocytic and oligodendroglial tumor | hsv(312°,50%,100%) | #ff80e5 |
Embryonal tumor | hsv(272°,100%,25%) | #220040 |
Ependymal tumor | hsv(248°,100%,100%) | #2200ff |
Germ cell tumor | hsv(208°,100%,85%) | #0074d9 |
Low-grade astrocytic tumor | hsv(208°,25%,75%) | #8fa9bf |
Meningioma | hsv(168°,75%,70%) | #2db398 |
Mesenchymal non-meningothelial tumor | hsv(80°,100%,75%) | #7fbf00 |
Neuronal and mixed neuronal-glial tumor | hsv(48°,100%,20%) | #332900 |
Tumor of cranial and paraspinal nerves | hsv(40°,100%,100%) | #ffaa00 |
Tumors of sellar region | hsv(16°,75%,70%) | #b2502d |
And then from there we could adjust hue/saturation per @jashapiro's comment, to obtain a cancer_group
palette of 17 colors where each of these cancer_group
N >= 10:
broad_histology |
cancer_group |
HSV value | hex code |
---|---|---|---|
Benign tumor | Choroid plexus papilloma | hsv(337°, 49%, 35%) | #592d3e |
Diffuse astrocytic and oligodendroglial tumor | Diffuse intrinsic pontine glioma | hsv(312°, 20%, 100%) | #ffccf5 |
Diffuse astrocytic and oligodendroglial tumor | Diffuse midline glioma | hsv(312°, 75%, 100%) | #ff40d9 |
Diffuse astrocytic and oligodendroglial tumor | High-grade glioma astrocytoma | hsv(312°, 100%, 75%) | #bf0099 |
Embryonal tumor | Atypical Teratoid Rhabdoid Tumor | hsv(272°, 90%, 52%) | #4d0d85 |
Embryonal tumor | CNS Embryonal tumor | hsv(272°, 67%, 68%) | #7739ad |
Embryonal tumor | Medulloblastoma | hsv(271°, 25%, 45%) | #655673 |
Ependymal tumor | Ependymoma | hsv(248°,100%,100%) | #2200ff |
Germ cell tumor | Teratoma | hsv(208°, 75%, 85%) | #368dd9 |
Low-grade astrocytic tumor | Ganglioglioma | hsv(208°, 50%, 75%) | #6093bf |
Low-grade astrocytic tumor | Low-grade glioma astrocytoma | hsv(208°, 100%, 75%) | #0066bf |
Meningioma | Meningioma | hsv(168°,75%,70%) | #2db398 |
Mesenchymal non-meningothelial tumor | Ewing sarcoma | HSV(80°, 50%, 75%) | #9fbf60 |
Neuronal and mixed neuronal-glial tumor | Dysembryoplastic neuroepithelial tumor | hsv(48°, 49%, 20%) | #332e1a |
Tumor of cranial and paraspinal nerves | Neurofibroma Plexiform | hsv(40°, 75%, 90%) | #e6ac39 |
Tumor of cranial and paraspinal nerves | Schwannoma | hsv(48°, 98%, 56%) | #8f7303 |
Tumors of sellar region | Craniopharyngioma | hsv(16°, 100%, 70%) | #b33000 |
(Note: Some of the broad_histology
to cancer_group
are 1-to-1 mappings and we could consider using the same hex code between the palettes.)
And for cancer_group
or broad_histology
labels that don't make the cutoff based on sample size, we should devise ways to break those plots out individually as needed.
This seems like a good solution!
Sounds like a great plan!
Wondering if the blue hues might be a little mis-directing to read since some hues as Ependymal, Germ cell tumor and others for LGG ?
broad_histology | cancer_group | HSV value | hex code |
---|---|---|---|
Ependymal tumor | Ependymoma | hsv(248°,100%,100%) | #2200ff |
Germ cell tumor | Teratoma | hsv(208°, 75%, 85%) | #368dd9 |
Low-grade astrocytic tumor | Ganglioglioma | hsv(208°, 50%, 75%) | #6093bf |
Low-grade astrocytic tumor | Low-grade glioma astrocytoma | hsv(208°, 100%, 75%) | #0066bf |
(Note: Some of the broad_histology to cancer_group are 1-to-1 mappings and we could consider using the same hex code between the palettes.) I would lean toward doing this.
Couple color choices that I would worry about:
Embryonal tumor/Medulloblastoma: #655673
reads very grey to me, so it might not be super visible with an "other" class.
Maybe go more intense with something like H271 S96 L57 #9426fb
Neuronal and mixed neuronal-glial tumor: #332900
seems very dark, almost black on my screen
How about H48 S66 L25 #685815
there, and
broad_histology | cancer_group | HSV value | hex code |
---|---|---|---|
Neuronal and mixed neuronal-glial tumor | Dysembryoplastic neuroepithelial tumor | hsv(48°, 49%, 20%) | #685815 |
Tumor of cranial and paraspinal nerves | Neurofibroma Plexiform | hsv(40°, 75%, 90%) | #e6ac39 |
Tumor of cranial and paraspinal nerves | Schwannoma | hsv(40°, 100%, 34%) | #ab7200 |
Agree that the range of blues seems a bit compressed...
Colors are hard.
Attempted to make the blues situation a little bit better...
broad_histology |
HSV value | hex code |
---|---|---|
Benign tumor | hsv(336°,100%,35%) | #590024 |
Diffuse astrocytic and oligodendroglial tumor | hsv(312°,50%,100%) | #ff80e5 |
Embryonal tumor | hsv(272°,100%,25%) | #220040 |
Ependymal tumor | hsv(248°,100%,100%) | #2200ff |
Germ cell tumor | hsv(208°,100%,85%) | #0074d9 |
Low-grade astrocytic tumor | hsv(240°,25%,75%) | #8f8fbf |
Meningioma | hsv(168°,75%,70%) | #2db398 |
Mesenchymal non-meningothelial tumor | hsv(80°,100%,75%) | #7fbf00 |
Neuronal and mixed neuronal-glial tumor | hsv(48°, 66%, 25%) | #685815 |
Tumor of cranial and paraspinal nerves | hsv(40°,100%,100%) | #ffaa00 |
Tumors of sellar region | hsv(16°,75%,70%) | #b2502d |
broad_histology |
cancer_group |
HSV value | hex code |
---|---|---|---|
Benign tumor | Choroid plexus papilloma | hsv(337°, 49%, 35%) | #592d3e |
Diffuse astrocytic and oligodendroglial tumor | Diffuse intrinsic pontine glioma | hsv(312°, 20%, 100%) | #ffccf5 |
Diffuse astrocytic and oligodendroglial tumor | Diffuse midline glioma | hsv(312°, 75%, 100%) | #ff40d9 |
Diffuse astrocytic and oligodendroglial tumor | High-grade glioma astrocytoma | hsv(312°, 100%, 75%) | #bf0099 |
Embryonal tumor | Atypical Teratoid Rhabdoid Tumor | hsv(272°, 90%, 52%) | #4d0d85 |
Embryonal tumor | CNS Embryonal tumor | hsv(272°, 67%, 68%) | #7739ad |
Embryonal tumor | Medulloblastoma | hsv(271°, 96%, 57%) | #9426fb |
Ependymal tumor | Ependymoma | hsv(248°,100%,100%) | #2200ff |
Germ cell tumor | Teratoma | hsv(208°, 98%, 100%) | #058aff |
Low-grade astrocytic tumor | Ganglioglioma | hsv(240°, 45%, 100%) | #8c8cff |
Low-grade astrocytic tumor | Low-grade glioma astrocytoma | hsv(240°, 100%, 50%) | #000080 |
Meningioma | Meningioma | hsv(168°,75%,70%) | #2db398 |
Mesenchymal non-meningothelial tumor | Ewing sarcoma | HSV(80°, 50%, 75%) | #9fbf60 |
Neuronal and mixed neuronal-glial tumor | Dysembryoplastic neuroepithelial tumor | hsv(48°, 99%, 38%) | #614e01 |
Tumor of cranial and paraspinal nerves | Neurofibroma Plexiform | hsv(40°, 75%, 90%) | #e6ac39 |
Tumor of cranial and paraspinal nerves | Schwannoma | hsv(40°, 100%, 67%) | #ab7200 |
Tumors of sellar region | Craniopharyngioma | hsv(16°, 100%, 70%) | #b33000 |
Closing in favor of #1174 - thanks all!
Problem
The
display_group
palette was originally designed as a higher-level grouping that would allow us to show multiple histologies in the same figure. It was intended to be about 10-15 colors, but I think it is currently set at 18.We now have
cancer_group
as well, which is currently 58 colors.cancer_group
is more narrow thanbroad_histology
(https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/917#issuecomment-881720556 & #1128). Unfortunately, we will not be able to construct a 58 color palette without there being some difficulties distinguishing between different labels even before taking designing for accessibility into account.It is also the case that there are some hex codes shared between the
display_group
andcancer_group
, but since they are randomly assigned, a color might be indicative of a different, unrelated (e.g., samples are non-overlapping) label between figures https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-912752284.Idea
We need to identify some minimal set of groups that can be assigned a hex code, most likely a _subset of
cancer_group
_. This idea is inspired by the current Fig 3A in the draft figures @jharenza has in Google Slides, where the interaction plot groups many individualcancer_group
values intoOther
.The minimal
cancer_group
color palette will be intended to allow readers to scan & connect information about groups across main display figures. Within panels such as box plots and bar plots, we should rely very heavily on text labels.A constraint we have is that we will need to include
cancer_group
in the Oncoprint figures https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-914283630. We may also want to consider if analyzing withincancer_group
is appropriate in some cases (e.g., interaction plots, which is how this started in #917).Proposed next steps
In the other CNS panel of the oncoprint figure (currently Fig 2 in the slides), we could drop any cancer group with fewer than 5 samples with mutations for the main display item. By my count, we'd then have to create a 15 color palette (3 in the LGAT panel, 4 in the embryonal panel, 3 in the diffuse astrocytic and oligodendroglial tumor panel and 5 in other [current fig on GitHub]). That then becomes our minimal number of hex codes (15) and we can try to optimize for distinct colors within broad histology.
When we plot more than those 15 cancer groups, we would only color the 15 groups included in the oncoprint and all other cancer groups would remain a gray color used for Other. So for example, plots that would follow this convention include:
If folks agree with this idea, we'll overhaul the palette generation and documentation in
figures
accordingly (and I'll replace this issue with a new one geared towards how to make those changes). Tagging @jharenza @jashapiro and @kgaonkar6