AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

Rework cancer group / display group palettes to identify a minimal number of colors for multi-group display #1174

Closed jaclyn-taroni closed 3 years ago

jaclyn-taroni commented 3 years ago

Problem

The display_group palette was originally designed as a higher-level grouping that would allow us to show multiple histologies in the same figure. It was intended to be about 10-15 colors, but I think it is currently set at 18.

We now have cancer_group as well, which is currently 58 colors. cancer_group is more narrow than broad_histology (https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/917#issuecomment-881720556 & #1128). Unfortunately, we will not be able to construct a 58 color palette without there being some difficulties distinguishing between different labels even before taking designing for accessibility into account.

It is also the case that there are some hex codes shared between the display_group and cancer_group, but since they are randomly assigned, a color might be indicative of a different, unrelated (e.g., samples are non-overlapping) label between figures https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-912752284.

Idea

We need to identify some minimal set of groups that can be assigned a hex code, most likely a _subset of cancer_group_. This idea is inspired by the current Fig 3A in the draft figures @jharenza has in Google Slides, where the interaction plot groups many individual cancer_group values into Other.

The minimal cancer_group color palette will be intended to allow readers to scan & connect information about groups across main display figures. Within panels such as box plots and bar plots, we should rely very heavily on text labels.

A constraint we have is that we will need to include cancer_group in the Oncoprint figures https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1171#issuecomment-914283630. We may also want to consider if analyzing within cancer_group is appropriate in some cases (e.g., interaction plots, which is how this started in #917).

Proposed next steps

In the other CNS panel of the oncoprint figure (currently Fig 2 in the slides), we could drop any cancer group with fewer than 5 samples with mutations for the main display item. By my count, we'd then have to create a 15 color palette (3 in the LGAT panel, 4 in the embryonal panel, 3 in the diffuse astrocytic and oligodendroglial tumor panel and 5 in other [current fig on GitHub]). That then becomes our minimal number of hex codes (15) and we can try to optimize for distinct colors within broad histology.

When we plot more than those 15 cancer groups, we would only color the 15 groups included in the oncoprint and all other cancer groups would remain a gray color used for Other. So for example, plots that would follow this convention include:

If folks agree with this idea, we'll overhaul the palette generation and documentation in figures accordingly (and I'll replace this issue with a new one geared towards how to make those changes). Tagging @jharenza @jashapiro and @kgaonkar6

jaclyn-taroni commented 3 years ago

I'll also note another, possibly more straightforward way to limit the palette would be to only consider cancer_group where N >= 10 (18 colors).

kgaonkar6 commented 3 years ago

I'll also note another, possibly more straightforward way to limit the palette would be to only consider cancer_group where N >= 10 (18 colors).

I like this idea, more general and we will capture most of the data without prior histology based selection.

jaclyn-taroni commented 3 years ago

I know I said I would file a new issues about how to do this, but thinking ahead a tad: Functionally, if we were to have an individual hex code for 18 cancer_group values (or 15 depending on the outcome of discussion) and then one (gray) hex code for all other cancer_group values that would become the display_group. So the notebook for creating display_group would get updated to use cancer_group instead of broad_histology to create display_group.

jharenza commented 3 years ago

I think I have to check on which cancer_group are >=10, but I think for all figures other than the oncoprint, this may be fine. For the oncoprint, we should still annotate as the specific cancer. However, I think we also want to keep the broad_histology for some plots - eg the transcriptomic overview, so all points have a color, and the GSEA plot, I was thinking could have two annotations - one for broad, one more detailed. But, this also depends on whether the >=10 gets more detailed or if it is just the same samples being colored.

kgaonkar6 commented 3 years ago

Here's the list of cancer_group that are >10, I'm using release-v21-20210820 pbta-histologies.tsv , I've then checked for overlap with display_groups ( derived from broad_histology) below:

# cancer_group n>10 not in display_group
> pbta_hist %>% select(Kids_First_Participant_ID,cancer_group) %>% 
                               unique() %>% 
                               group_by(cancer_group) %>% 
                               tally() %>% filter(n>10) %>% 
                               filter(!cancer_group %in% histologies_color_key_df$display_group)
# A tibble: 14 × 2
   cancer_group                               n
   <chr>                                  <int>
 1 Atypical Teratoid Rhabdoid Tumor          28
 2 Choroid plexus papilloma                  14
 3 CNS Embryonal tumor                       13
 4 Craniopharyngioma                         38
 5 Diffuse midline glioma                    55
 6 Dysembryoplastic neuroepithelial tumor    26
 7 Ependymoma                                86
 8 Ganglioglioma                             46
 9 High-grade glioma astrocytoma             84
10 Low-grade glioma astrocytoma             229
11 Medulloblastoma                          118
12 Neurofibroma Plexiform                    19
13 Schwannoma                                16
14 NA                                       815

# cancer_group n>10 in display_group
> pbta_hist %>% select(Kids_First_Participant_ID,cancer_group) %>% 
                              unique() %>% 
                              group_by(cancer_group) %>%
                              tally() %>% filter(n>10) %>% 
                              filter(cancer_group %in% histologies_color_key_df$display_group)
# A tibble: 1 × 2
  cancer_group     n
  <chr>        <int>
1 Meningioma      27
jaclyn-taroni commented 3 years ago

I think I have to check on which cancer_group are >=10, but I think for all figures other than the oncoprint, this may be fine. For the oncoprint, we should still annotate as the specific cancer.

If the specific cancer does not meet this criteria, we could include the oncoprint in the supplemental material instead and possibly even split up by cancer_group, rather than broad_histology, which would allow us to avoid having to worry about colors for specific cancer groups with N < 10.

However, I think we also want to keep the broad_histology for some plots - eg the transcriptomic overview, so all points have a color, and the GSEA plot, I was thinking could have two annotations - one for broad, one more detailed. But, this also depends on whether the >=10 gets more detailed or if it is just the same samples being colored.

Depending on what you find, I'd consider: How would we have two separate palettes without the colors overlapping (or being very close)? I am concerned about potentially causing confusion in the main text. It is also worth considering:

jashapiro commented 3 years ago

A few semi-random thoughts on this:

jaclyn-taroni commented 3 years ago

For everyone's reference, re: the relationship between broad_histology and cancer_group (using v21)

histologies_df %>% 
  filter(sample_type == "Tumor") %>% 
  select(sample_id, broad_histology, cancer_group) %>% 
  distinct() %>% 
  group_by(broad_histology, cancer_group) %>% 
  tally()
broad_histology cancer_group    n
Benign tumor    Adenoma 4
Benign tumor    Atypical choroid plexus papilloma   2
Benign tumor    Choroid plexus papilloma    14
Benign tumor    NA  19
Chordoma    Chordoma    6
Choroid plexus tumor    Choroid plexus carcinoma    4
Choroid plexus tumor    Choroid plexus cyst 1
Diffuse astrocytic and oligodendroglial tumor   Diffuse intrinsic pontine glioma    10
Diffuse astrocytic and oligodendroglial tumor   Diffuse midline glioma  82
Diffuse astrocytic and oligodendroglial tumor   High-grade glioma astrocytoma   103
Diffuse astrocytic and oligodendroglial tumor   Oligodendroglioma   2
Embryonal tumor Atypical Teratoid Rhabdoid Tumor    32
Embryonal tumor CNS Embryonal tumor 13
Embryonal tumor CNS neuroblastoma   3
Embryonal tumor Embryonal tumor with multilayer rosettes    7
Embryonal tumor Ganglioneuroblastoma    3
Embryonal tumor Medulloblastoma 127
Embryonal tumor Neuroblastoma   2
Ependymal tumor Ependymoma  97
Germ cell tumor Germinoma   4
Germ cell tumor Germinoma-Teratoma  1
Germ cell tumor Teratoma    10
Histiocytic tumor   Juvenile xanthogranuloma    2
Histiocytic tumor   Langerhans Cell histiocytosis   4
Histiocytic tumor   Rosai-Dorfman disease   1
Low-grade astrocytic tumor  Diffuse fibrillary astrocytoma  1
Low-grade astrocytic tumor  Ganglioglioma   50
Low-grade astrocytic tumor  Low-grade glioma astrocytoma    248
Low-grade astrocytic tumor  Pilocytic astrocytoma   3
Low-grade astrocytic tumor  Pleomorphic xanthoastrocytoma   2
Low-grade astrocytic tumor  Subependymal Giant Cell Astrocytoma 4
Lymphoma    CNS Burkitt's lymphoma  1
Melanocytic tumor   Melanocytic tumor   1
Meningioma  Meningioma  32
Mesenchymal non-meningothelial tumor    Cavernoma   2
Mesenchymal non-meningothelial tumor    Ewing sarcoma   11
Mesenchymal non-meningothelial tumor    Fibromyxoid lesion  1
Mesenchymal non-meningothelial tumor    Hemangioblastoma    3
Mesenchymal non-meningothelial tumor    Myofibroblastoma    1
Mesenchymal non-meningothelial tumor    Rhabdomyosarcoma    2
Mesenchymal non-meningothelial tumor    Sarcoma 6
Metastatic tumors   Metastatic secondary tumors 5
Metastatic tumors   Metastatic secondary tumors-Neuroblastoma   3
Neuronal and mixed neuronal-glial tumor Desmoplastic infantile astrocytoma and ganglioglioma    3
Neuronal and mixed neuronal-glial tumor Diffuse leptomeningeal glioneuronal tumor   1
Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor  26
Neuronal and mixed neuronal-glial tumor Dysplasia Gliosis-Glial-neuronal tumor NOS  1
Neuronal and mixed neuronal-glial tumor Glial-neuronal tumor NOS    9
Neuronal and mixed neuronal-glial tumor Neurocytoma 3
Neuronal and mixed neuronal-glial tumor Rosette-forming glioneuronal tumor  1
Non-CNS tumor   Myxoid spindle cell tumor   1
Non-tumor   Arteriovenous malformation  1
Non-tumor   Epilepsy    1
Non-tumor   Reactive connective tissue  1
Other tumor Ganglioneuroma  1
Pre-cancerous lesion    NA  14
Tumor of cranial and paraspinal nerves  Malignant peripheral nerve sheath tumor 4
Tumor of cranial and paraspinal nerves  Neurofibroma Plexiform  23
Tumor of cranial and paraspinal nerves  Schwannoma  19
Tumor of pineal region  Pineoblastoma   4
Tumors of sellar region Craniopharyngioma   38

Edit then filtering by n > 10:

   broad_histology                               cancer_group                               n
   <chr>                                         <chr>                                  <int>
 1 Benign tumor                                  Choroid plexus papilloma                  14
 2 Benign tumor                                  NA                                        19
 3 Diffuse astrocytic and oligodendroglial tumor Diffuse intrinsic pontine glioma          10
 4 Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma                    82
 5 Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma            103
 6 Embryonal tumor                               Atypical Teratoid Rhabdoid Tumor          32
 7 Embryonal tumor                               CNS Embryonal tumor                       13
 8 Embryonal tumor                               Medulloblastoma                          127
 9 Ependymal tumor                               Ependymoma                                97
10 Germ cell tumor                               Teratoma                                  10
11 Low-grade astrocytic tumor                    Ganglioglioma                             50
12 Low-grade astrocytic tumor                    Low-grade glioma astrocytoma             248
13 Meningioma                                    Meningioma                                32
14 Mesenchymal non-meningothelial tumor          Ewing sarcoma                             11
15 Neuronal and mixed neuronal-glial tumor       Dysembryoplastic neuroepithelial tumor    26
16 Pre-cancerous lesion                          NA                                        14
17 Tumor of cranial and paraspinal nerves        Neurofibroma Plexiform                    23
18 Tumor of cranial and paraspinal nerves        Schwannoma                                19
19 Tumors of sellar region                       Craniopharyngioma                         38

So 19, using that methodology, not 18 but would be 17 if cancer_group == NA dropped.

jaclyn-taroni commented 3 years ago

Okay the individual colors here need work to be sure, but I'm going to post to convey the point. We could create a broad_histology palette used for things like the UMAP plot where it is comprised of only broad_histology where N >= 10 in at least 1 cancer_group within the broad_histology to get an 11 color palette like so:

broad_histology HSV value hex code
Benign tumor hsv(336°,100%,35%) #590024
Diffuse astrocytic and oligodendroglial tumor hsv(312°,50%,100%) #ff80e5
Embryonal tumor hsv(272°,100%,25%) #220040
Ependymal tumor hsv(248°,100%,100%) #2200ff
Germ cell tumor hsv(208°,100%,85%) #0074d9
Low-grade astrocytic tumor hsv(208°,25%,75%) #8fa9bf
Meningioma hsv(168°,75%,70%) #2db398
Mesenchymal non-meningothelial tumor hsv(80°,100%,75%) #7fbf00
Neuronal and mixed neuronal-glial tumor hsv(48°,100%,20%) #332900
Tumor of cranial and paraspinal nerves hsv(40°,100%,100%) #ffaa00
Tumors of sellar region hsv(16°,75%,70%) #b2502d

And then from there we could adjust hue/saturation per @jashapiro's comment, to obtain a cancer_group palette of 17 colors where each of these cancer_group N >= 10:

broad_histology cancer_group HSV value hex code
Benign tumor Choroid plexus papilloma hsv(337°, 49%, 35%) #592d3e
Diffuse astrocytic and oligodendroglial tumor Diffuse intrinsic pontine glioma hsv(312°, 20%, 100%) #ffccf5
Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma hsv(312°, 75%, 100%) #ff40d9
Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma hsv(312°, 100%, 75%) #bf0099
Embryonal tumor Atypical Teratoid Rhabdoid Tumor hsv(272°, 90%, 52%) #4d0d85
Embryonal tumor CNS Embryonal tumor hsv(272°, 67%, 68%) #7739ad
Embryonal tumor Medulloblastoma hsv(271°, 25%, 45%) #655673
Ependymal tumor Ependymoma hsv(248°,100%,100%) #2200ff
Germ cell tumor Teratoma hsv(208°, 75%, 85%) #368dd9
Low-grade astrocytic tumor Ganglioglioma hsv(208°, 50%, 75%) #6093bf
Low-grade astrocytic tumor Low-grade glioma astrocytoma hsv(208°, 100%, 75%) #0066bf
Meningioma Meningioma hsv(168°,75%,70%) #2db398
Mesenchymal non-meningothelial tumor Ewing sarcoma HSV(80°, 50%, 75%) #9fbf60
Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor hsv(48°, 49%, 20%) #332e1a
Tumor of cranial and paraspinal nerves Neurofibroma Plexiform hsv(40°, 75%, 90%) #e6ac39
Tumor of cranial and paraspinal nerves Schwannoma hsv(48°, 98%, 56%) #8f7303
Tumors of sellar region Craniopharyngioma hsv(16°, 100%, 70%) #b33000

(Note: Some of the broad_histology to cancer_group are 1-to-1 mappings and we could consider using the same hex code between the palettes.)

And for cancer_group or broad_histology labels that don't make the cutoff based on sample size, we should devise ways to break those plots out individually as needed.

jharenza commented 3 years ago

This seems like a good solution!

kgaonkar6 commented 3 years ago

Sounds like a great plan!

Wondering if the blue hues might be a little mis-directing to read since some hues as Ependymal, Germ cell tumor and others for LGG ?

broad_histology cancer_group HSV value hex code
Ependymal tumor Ependymoma hsv(248°,100%,100%) #2200ff
Germ cell tumor Teratoma hsv(208°, 75%, 85%) #368dd9
Low-grade astrocytic tumor Ganglioglioma hsv(208°, 50%, 75%) #6093bf
Low-grade astrocytic tumor Low-grade glioma astrocytoma hsv(208°, 100%, 75%) #0066bf
jashapiro commented 3 years ago

(Note: Some of the broad_histology to cancer_group are 1-to-1 mappings and we could consider using the same hex code between the palettes.) I would lean toward doing this.

Couple color choices that I would worry about:

Embryonal tumor/Medulloblastoma: #655673 reads very grey to me, so it might not be super visible with an "other" class. Maybe go more intense with something like H271 S96 L57 #9426fb

Neuronal and mixed neuronal-glial tumor: #332900 seems very dark, almost black on my screen How about H48 S66 L25 #685815 there, and

broad_histology cancer_group HSV value hex code
Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor hsv(48°, 49%, 20%) #685815
Tumor of cranial and paraspinal nerves Neurofibroma Plexiform hsv(40°, 75%, 90%) #e6ac39
Tumor of cranial and paraspinal nerves Schwannoma hsv(40°, 100%, 34%) #ab7200

Agree that the range of blues seems a bit compressed...

Colors are hard.

jaclyn-taroni commented 3 years ago

Attempted to make the blues situation a little bit better...

broad_histology HSV value hex code
Benign tumor hsv(336°,100%,35%) #590024
Diffuse astrocytic and oligodendroglial tumor hsv(312°,50%,100%) #ff80e5
Embryonal tumor hsv(272°,100%,25%) #220040
Ependymal tumor hsv(248°,100%,100%) #2200ff
Germ cell tumor hsv(208°,100%,85%) #0074d9
Low-grade astrocytic tumor hsv(240°,25%,75%) #8f8fbf
Meningioma hsv(168°,75%,70%) #2db398
Mesenchymal non-meningothelial tumor hsv(80°,100%,75%) #7fbf00
Neuronal and mixed neuronal-glial tumor hsv(48°, 66%, 25%) #685815
Tumor of cranial and paraspinal nerves hsv(40°,100%,100%) #ffaa00
Tumors of sellar region hsv(16°,75%,70%) #b2502d
broad_histology cancer_group HSV value hex code
Benign tumor Choroid plexus papilloma hsv(337°, 49%, 35%) #592d3e
Diffuse astrocytic and oligodendroglial tumor Diffuse intrinsic pontine glioma hsv(312°, 20%, 100%) #ffccf5
Diffuse astrocytic and oligodendroglial tumor Diffuse midline glioma hsv(312°, 75%, 100%) #ff40d9
Diffuse astrocytic and oligodendroglial tumor High-grade glioma astrocytoma hsv(312°, 100%, 75%) #bf0099
Embryonal tumor Atypical Teratoid Rhabdoid Tumor hsv(272°, 90%, 52%) #4d0d85
Embryonal tumor CNS Embryonal tumor hsv(272°, 67%, 68%) #7739ad
Embryonal tumor Medulloblastoma hsv(271°, 96%, 57%) #9426fb
Ependymal tumor Ependymoma hsv(248°,100%,100%) #2200ff
Germ cell tumor Teratoma hsv(208°, 98%, 100%) #058aff
Low-grade astrocytic tumor Ganglioglioma hsv(240°, 45%, 100%) #8c8cff
Low-grade astrocytic tumor Low-grade glioma astrocytoma hsv(240°, 100%, 50%) #000080
Meningioma Meningioma hsv(168°,75%,70%) #2db398
Mesenchymal non-meningothelial tumor Ewing sarcoma HSV(80°, 50%, 75%) #9fbf60
Neuronal and mixed neuronal-glial tumor Dysembryoplastic neuroepithelial tumor hsv(48°, 99%, 38%) #614e01
Tumor of cranial and paraspinal nerves Neurofibroma Plexiform hsv(40°, 75%, 90%) #e6ac39
Tumor of cranial and paraspinal nerves Schwannoma hsv(40°, 100%, 67%) #ab7200
Tumors of sellar region Craniopharyngioma hsv(16°, 100%, 70%) #b33000
jaclyn-taroni commented 3 years ago

Closing in favor of #1174 - thanks all!