clemente-lab / mmeds-meta

A database for storing and analyzing omics data
https://mmeds.org
2 stars 1 forks source link

Improved summaries: implement gradient color scale for continuous variables #356

Closed adamcantor22 closed 2 years ago

adamcantor22 commented 2 years ago

Is your feature request related to a problem? Please describe. With the removal of taxa plots on continuous variables, we may have reached a place where there are no typical runtime errors in the jupyter notebook. However, the legends for alpha/beta diversity on continuous variables are still pretty much unreadable.

Describe the solution you'd like Instead of a random color for each individual numerical entry, there needs to be a single color gradient scale for each of the continuous variables.

adamcantor22 commented 2 years ago

Connected with issues #339 and #322

cleme commented 2 years ago

Provide an example of alpha/beta div data and plot in which legends are not readable.

adamcantor22 commented 2 years ago

Alpha plot and legend: Screenshot from 2021-12-07 13-50-49 Screenshot from 2021-12-07 13-51-03

Beta plot and legend: Screenshot from 2021-12-07 13-51-20 Screenshot from 2021-12-07 13-51-27

cleme commented 2 years ago

We need the data used to generate that plot. So the same as the other issue: generate the DF, save to a file, then try the solution I suggested there (load with R, etc).

cleme commented 2 years ago

Also paste here the exact code used to generate each of the plots.

adamcantor22 commented 2 years ago

Reproduce Alpha plot:

  1. Download this file: https://drive.google.com/file/d/1i-gA4nR6Ln3I-wTWOxfVQRV49c-FfrkA/view?usp=sharing
  2. Run these cells:
    import pandas as pd`
    df = pd.read_csv('faith_pd_df.csv')`
    
    %%R -i df
    pd <- position_dodge(width = 50)

p <- ggplot(data = df, aes(x = SamplingDepth, y = AverageValue, color = GroupID)) + geom_errorbar(aes(ymin=AverageValue-Error, ymax=AverageValue+Error), width=100, position = pd) + geom_point(stat='identity', position = pd, size = 1) + geom_line(stat='identity', position = pd) + facet_wrap(~GroupName) + colFill + colScale + labs(title = 'Alpha Diversity', subtitle = 'Grouped by Metadata Catagory') + theme_bw() + theme(legend.position = 'none', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Save plots

ggsave('faith_pd.png', height = 6, width = 6)

Image('faith_pd.png')


______________________________________________________________________________

Reproduce beta plot:
1. Download this file: https://drive.google.com/file/d/1Bfij6mlDW9DDH90V3QyyQ1smccKbZPEs/view?usp=sharing
2. Run these cells:

import pandas as pd df = pd.read_csv('bray_curtis_BirthYear_df.csv')

%%R -i df

Create the plots for the first three PCs

png('bray_curtis_pcoa_results-BirthYear.png', width = 6, height = 6, unit='in', res=200) p <- ggpairs(df[,c(1:3)], upper = list(continuous = "points", combo = "box_no_facet"), lower = list(continuous = "points", combo = "dot_no_facet"), aes(color = df$GroupID, label = rownames(df), alpha=0.5)) + theme_bw() + theme(legend.position = 'none', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + labs(title = 'PCA plot', subtitle = 'Colored by BirthYear')

Add the color palette to each of the plots

for(i in 1:p$nrow) { for(j in 1:p$ncol){ p[i,j] <- p[i,j] + colScale + colFill } } print(p) out <- dev.off()

Print the individual PCA plots with labels

for(i in 1:p$nrow) { for(j in 1:p$ncol){

Only print the PCAs not the frequency distributions

    if (i > 2 && j < 3 || i > 1 && j < 2) {
        # Setup and save each individual PCA plot
        filename <- sprintf('bray_curtis_pcoa_results-BirthYear-%s-%s.png',
                            p[i, j]$labels$x,
                            p[i, j]$labels$y)
        png(filename, width = 6, height = 6, unit='in', res=200)
        sp <- p[i,j] + geom_text_repel() +
                  theme(legend.position = 'none',
                        plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5)) +
                  labs(title = sprintf('%s vs. %s',
                                       p[i, j]$labels$x,
                                       p[i, j]$labels$y),
                       subtitle = 'Colored by BirthYear')
        print(sp)
        out <- dev.off()
    }
}

}

Image("bray_curtis_pcoa_results-BirthYear.png")

cleme commented 2 years ago

In alpha plot, colFill + colScale are undefined in the code snippet you provide. Paste the code that generates those.

adamcantor22 commented 2 years ago

yeah, those are in the setup cells. The first two will need to be run, which also means you need some more files.

  1. Download the files in this folder: https://drive.google.com/drive/folders/1804DIzR3VnNR92bdg0mb4ezWo4iZlnAv?usp=sharing
  2. Run these cells prior to the others:
    
    from pathlib import Path
    from copy import deepcopy
    import pandas as pd
    import rpy2.rinterface
    from math import floor
    from warnings import filterwarnings
    from IPython.display import Image
    from random import shuffle
    from PIL import Image as PImage
    from PIL import ImageDraw, ImageFont
    from mmeds.util import load_config

filterwarnings('ignore', category=rpy2.rinterface.RRuntimeWarning)

Load the configuration

config = load_config(Path('config_file.yaml'), Path('metadata.tsv'), True)

Load metadata file

if 'qiime2' == 'qiime2': mdf = pd.read_csv('qiime_mapping_file.tsv', skiprows=[1], sep='\t', dtype={'#SampleID': 'str'}) else: mdf = pd.read_csv('qiime_mapping_file.tsv', sep='\t') mdf.set_index('#SampleID', inplace=True)

Load the columns to use for analysis

metadata_columns = sorted(config['metadata'])

Stores a list of values shared accross groups but unique within (for graphing)

max_colors = 0 max_colors_con = 0

Create information for R color palettes

for group_name in metadata_columns: if group_name in mdf and not mdf[group_name].isnull().all(): grouping = mdf[group_name] uni = grouping.nunique() if config['metadata_continuous'][group_name]:

Get the category with the most colors to use for palette creation

        if uni > max_colors_con:
            max_colors_con = uni
    else:
        # Get the category with the most colors to use for palette creation
        if uni > max_colors:
            max_colors = uni

all_colors_con = ['con{}'.format(i) for i in range(max_colors_con)] all_colors = ['color{}'.format(i) for i in range(max_colors)]

Load the extention for jupyter

%load_ext rpy2.ipython

%%R -i all_colors -i all_colors_con -o allRGB library(ggplot2) library(RColorBrewer) library(GGally) library(ggrepel)

Create custom color palette

myColors <- brewer.pal(11, "Paired") colorMaker <- colorRampPalette(myColors) allColorsDisc <- colorMaker(length(unique(all_colors)))

Custom continuous(ish) color palette

myColorsCon <- brewer.pal(11, "Spectral") colorMakerCon <- colorRampPalette(myColorsCon) allColorsCon <- colorMakerCon(length(unique(all_colors_con)))

Rename the colors to match with the groups

names(allColorsDisc) <- all_colors names(allColorsCon) <- all_colors_con

Create the objects for graphing with the colors

allColors <- append(allColorsDisc, allColorsCon) colScale <- scale_color_manual(name = ~GroupID, values = allColors) colFill <- scale_fill_manual(name = ~GroupID, values = allColors)

Rename the colors to match with the groups

Get the RGB values for the colors

allRGB <- data.frame(apply(data.frame(allColors), 1, col2rgb))

cleme commented 2 years ago

Too much code, we need to simplify again. Based on that code, generate a color palette that can be directly loaded from a file, so that when we get to the plotting it's only a matter of loading the data + loading the palette.

cleme commented 2 years ago

In the alpha diversity plot, the most likely explanation for the issue is that we are combining discrete and continuous variables. For example, Sex is discrete but Weight is continuous. It is not going to be possible to plot with a single scheme, because gradient coloring requires something like scale_colour_gradient, which would be applied to the full set of panels.

Two possible approaches:

cleme commented 2 years ago

In the beta diversity plot, it is possible to generate continuous coloring by using scale_color_gradient. A simplified version of the code:

Start by creating a continuous variable:

> df$color <- 0
> df$color[df$GroupID=="color1"] <- 1
> df$color[df$GroupID=="color2"] <- 2
> df$color[df$GroupID=="color3"] <- 3
> df$color[df$GroupID=="color4"] <- 4
> df$color[df$GroupID=="color5"] <- 5
> df$color[df$GroupID=="color6"] <- 6
> df$color[df$GroupID=="color7"] <- 7
> df$color[df$GroupID=="color8"] <- 8
> df$color[df$GroupID=="color9"] <- 9
> df$color[df$GroupID=="color10"] <- 10
> df$color[df$GroupID=="color11"] <- 11
> df$color[df$GroupID=="color12"] <- 12

then use scale_color_gradient to achieve effect:

> p <- ggpairs(df[,c(1:3)],
              upper = list(continuous = "points", combo = "box_no_facet"),
              lower = list(continuous = "points", combo = "dot_no_facet"),
              aes(color = df$color, label = rownames(df), alpha=0.5)) +
              theme_bw() +
              theme(legend.position = 'none',
              plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust = 0.5)) +
             labs(title = 'PCA plot', subtitle = 'Colored by BirthYear') +
             scale_color_gradient(low="#132B43",high="#56B1F7")
p

test_beta

The solution to maintain consistent coloring across different plots in the summary is to generate a priori the low and high limits of the color gradient (in this case #132B43 and #56B1F7, and then pass that to R as a parameter.

adamcantor22 commented 2 years ago

Reproducing plots:

  1. Download this file: https://drive.google.com/file/d/1uowEnWo2FtLmgKrjU761TrHh6N-jQCHb/view?usp=sharing
  2. You may need to run the setup cells given above.
  3. Run:
    import pandas as pd
    df = pd.read_csv('bray_curtis_continuous_BirthYear_df.csv')
    
    %%R -i df

Create the plots for the first three PCs

png('bray_curtis_pcoa_results-BirthYear.png', width = 6, height = 6, unit='in', res=200) p <- ggpairs(df[,c(1:3)], legend = 4, upper = list(continuous = "points", combo = "box_no_facet"), lower = list(continuous = "points", combo = "dot_no_facet"), aes(color = df$variable, label = rownames(df), alpha=0.5)) + theme_bw() + theme(legend.position = 'bottom', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + labs(title = 'PCA plot', subtitle = 'Colored by BirthYear') + scale_color_gradient(low="#0039D0", high="#FF0000", name="BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")

print(p) out <- dev.off()

Print the individual PCA plots with labels

for(i in 1:p$nrow) { for(j in 1:p$ncol){

Only print the PCAs not the frequency distributions

    if (i > 2 && j < 3 || i > 1 && j < 2) {
        # Setup and save each individual PCA plot
        filename <- sprintf('bray_curtis_pcoa_results-BirthYear-%s-%s.png',
                            p[i, j]$labels$x,
                            p[i, j]$labels$y)
        png(filename, width = 6, height = 6, unit='in', res=200)
        sp <- p[i,j] + geom_text_repel() +
                  theme(legend.position = 'none',
                        plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5)) +
                  labs(title = sprintf('%s vs. %s',
                                       p[i, j]$labels$x,
                                       p[i, j]$labels$y),
                       subtitle = 'Colored by BirthYear') +
                  scale_color_gradient(low="#0039D0", high="#FF0000", name = "BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")
        print(sp)
        out <- dev.off()
    }
}

}

Image("bray_curtis_pcoa_results-BirthYear.png")

cleme commented 2 years ago

Issue of alpha = 0.5 appearing in the legend: this is the expected behavior, since we are selecting two parameters to color within aes: the gradient (based on df$variable) and alpha fixed at 0.5 (so the points have some transparency). The legend reflects all aesthetics (parameters) that appear explicitly mentioned within aes, so the solution is to remove the alpha value from the aesthetic, and place it as a fixed value outside of it. In our case, we say the upper and the lower triangles of plots will show points with fixed alpha=0.5:

p <- ggpairs(df[,c(1:3)],
         legend = 4,
         upper = list(continuous = wrap("points", alpha=0.5)),
         lower = list(continuous = wrap("points", alpha=0.5)),
         aes(color = df$variable, label = rownames(df))) +
     theme_bw() +
     theme(legend.position = 'bottom',
           plot.title = element_text(hjust = 0.5),
           plot.subtitle = element_text(hjust = 0.5)) +
     labs(title = 'PCA plot',
          subtitle = 'Colored by BirthYear') +
     scale_color_gradient(low="#0039D0", high="#FF0000", name="BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")

plot-birth-alpha

so that fixes the issue of alpha showing up in legend. Is there any other issue here?

adamcantor22 commented 2 years ago

Yes, one more thing. The colorbar doesn’t appear under the individual PC plots that come after that one, only under the grid. If need be we could probably live without that though

cleme commented 2 years ago

Because there is again a legend.position=none in the R code for individual PC plots. So if you remove that, you should also get it under those plots.