Closed adamcantor22 closed 2 years ago
Connected with issues #339 and #322
Provide an example of alpha/beta div data and plot in which legends are not readable.
Alpha plot and legend:
Beta plot and legend:
We need the data used to generate that plot. So the same as the other issue: generate the DF, save to a file, then try the solution I suggested there (load with R, etc).
Also paste here the exact code used to generate each of the plots.
Reproduce Alpha plot:
import pandas as pd`
df = pd.read_csv('faith_pd_df.csv')`
%%R -i df
pd <- position_dodge(width = 50)
p <- ggplot(data = df, aes(x = SamplingDepth, y = AverageValue, color = GroupID)) + geom_errorbar(aes(ymin=AverageValue-Error, ymax=AverageValue+Error), width=100, position = pd) + geom_point(stat='identity', position = pd, size = 1) + geom_line(stat='identity', position = pd) + facet_wrap(~GroupName) + colFill + colScale + labs(title = 'Alpha Diversity', subtitle = 'Grouped by Metadata Catagory') + theme_bw() + theme(legend.position = 'none', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
ggsave('faith_pd.png', height = 6, width = 6)
Image('faith_pd.png')
______________________________________________________________________________
Reproduce beta plot:
1. Download this file: https://drive.google.com/file/d/1Bfij6mlDW9DDH90V3QyyQ1smccKbZPEs/view?usp=sharing
2. Run these cells:
import pandas as pd df = pd.read_csv('bray_curtis_BirthYear_df.csv')
%%R -i df
png('bray_curtis_pcoa_results-BirthYear.png', width = 6, height = 6, unit='in', res=200) p <- ggpairs(df[,c(1:3)], upper = list(continuous = "points", combo = "box_no_facet"), lower = list(continuous = "points", combo = "dot_no_facet"), aes(color = df$GroupID, label = rownames(df), alpha=0.5)) + theme_bw() + theme(legend.position = 'none', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + labs(title = 'PCA plot', subtitle = 'Colored by BirthYear')
for(i in 1:p$nrow) { for(j in 1:p$ncol){ p[i,j] <- p[i,j] + colScale + colFill } } print(p) out <- dev.off()
for(i in 1:p$nrow) { for(j in 1:p$ncol){
if (i > 2 && j < 3 || i > 1 && j < 2) {
# Setup and save each individual PCA plot
filename <- sprintf('bray_curtis_pcoa_results-BirthYear-%s-%s.png',
p[i, j]$labels$x,
p[i, j]$labels$y)
png(filename, width = 6, height = 6, unit='in', res=200)
sp <- p[i,j] + geom_text_repel() +
theme(legend.position = 'none',
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = sprintf('%s vs. %s',
p[i, j]$labels$x,
p[i, j]$labels$y),
subtitle = 'Colored by BirthYear')
print(sp)
out <- dev.off()
}
}
}
Image("bray_curtis_pcoa_results-BirthYear.png")
In alpha plot, colFill + colScale
are undefined in the code snippet you provide. Paste the code that generates those.
yeah, those are in the setup cells. The first two will need to be run, which also means you need some more files.
from pathlib import Path
from copy import deepcopy
import pandas as pd
import rpy2.rinterface
from math import floor
from warnings import filterwarnings
from IPython.display import Image
from random import shuffle
from PIL import Image as PImage
from PIL import ImageDraw, ImageFont
from mmeds.util import load_config
filterwarnings('ignore', category=rpy2.rinterface.RRuntimeWarning)
config = load_config(Path('config_file.yaml'), Path('metadata.tsv'), True)
if 'qiime2' == 'qiime2': mdf = pd.read_csv('qiime_mapping_file.tsv', skiprows=[1], sep='\t', dtype={'#SampleID': 'str'}) else: mdf = pd.read_csv('qiime_mapping_file.tsv', sep='\t') mdf.set_index('#SampleID', inplace=True)
metadata_columns = sorted(config['metadata'])
max_colors = 0 max_colors_con = 0
for group_name in metadata_columns: if group_name in mdf and not mdf[group_name].isnull().all(): grouping = mdf[group_name] uni = grouping.nunique() if config['metadata_continuous'][group_name]:
if uni > max_colors_con:
max_colors_con = uni
else:
# Get the category with the most colors to use for palette creation
if uni > max_colors:
max_colors = uni
all_colors_con = ['con{}'.format(i) for i in range(max_colors_con)] all_colors = ['color{}'.format(i) for i in range(max_colors)]
%load_ext rpy2.ipython
%%R -i all_colors -i all_colors_con -o allRGB library(ggplot2) library(RColorBrewer) library(GGally) library(ggrepel)
myColors <- brewer.pal(11, "Paired") colorMaker <- colorRampPalette(myColors) allColorsDisc <- colorMaker(length(unique(all_colors)))
myColorsCon <- brewer.pal(11, "Spectral") colorMakerCon <- colorRampPalette(myColorsCon) allColorsCon <- colorMakerCon(length(unique(all_colors_con)))
names(allColorsDisc) <- all_colors names(allColorsCon) <- all_colors_con
allColors <- append(allColorsDisc, allColorsCon) colScale <- scale_color_manual(name = ~GroupID, values = allColors) colFill <- scale_fill_manual(name = ~GroupID, values = allColors)
allRGB <- data.frame(apply(data.frame(allColors), 1, col2rgb))
Too much code, we need to simplify again. Based on that code, generate a color palette that can be directly loaded from a file, so that when we get to the plotting it's only a matter of loading the data + loading the palette.
In the alpha diversity plot, the most likely explanation for the issue is that we are combining discrete and continuous variables. For example, Sex
is discrete but Weight
is continuous. It is not going to be possible to plot with a single scheme, because gradient coloring requires something like scale_colour_gradient
, which would be applied to the full set of panels.
Two possible approaches:
In the beta diversity plot, it is possible to generate continuous coloring by using scale_color_gradient
. A simplified version of the code:
Start by creating a continuous variable:
> df$color <- 0
> df$color[df$GroupID=="color1"] <- 1
> df$color[df$GroupID=="color2"] <- 2
> df$color[df$GroupID=="color3"] <- 3
> df$color[df$GroupID=="color4"] <- 4
> df$color[df$GroupID=="color5"] <- 5
> df$color[df$GroupID=="color6"] <- 6
> df$color[df$GroupID=="color7"] <- 7
> df$color[df$GroupID=="color8"] <- 8
> df$color[df$GroupID=="color9"] <- 9
> df$color[df$GroupID=="color10"] <- 10
> df$color[df$GroupID=="color11"] <- 11
> df$color[df$GroupID=="color12"] <- 12
then use scale_color_gradient
to achieve effect:
> p <- ggpairs(df[,c(1:3)],
upper = list(continuous = "points", combo = "box_no_facet"),
lower = list(continuous = "points", combo = "dot_no_facet"),
aes(color = df$color, label = rownames(df), alpha=0.5)) +
theme_bw() +
theme(legend.position = 'none',
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = 'PCA plot', subtitle = 'Colored by BirthYear') +
scale_color_gradient(low="#132B43",high="#56B1F7")
p
The solution to maintain consistent coloring across different plots in the summary is to generate a priori the low and high limits of the color gradient (in this case #132B43 and #56B1F7, and then pass that to R as a parameter.
Reproducing plots:
import pandas as pd
df = pd.read_csv('bray_curtis_continuous_BirthYear_df.csv')
%%R -i df
png('bray_curtis_pcoa_results-BirthYear.png', width = 6, height = 6, unit='in', res=200) p <- ggpairs(df[,c(1:3)], legend = 4, upper = list(continuous = "points", combo = "box_no_facet"), lower = list(continuous = "points", combo = "dot_no_facet"), aes(color = df$variable, label = rownames(df), alpha=0.5)) + theme_bw() + theme(legend.position = 'bottom', plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) + labs(title = 'PCA plot', subtitle = 'Colored by BirthYear') + scale_color_gradient(low="#0039D0", high="#FF0000", name="BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")
print(p) out <- dev.off()
for(i in 1:p$nrow) { for(j in 1:p$ncol){
if (i > 2 && j < 3 || i > 1 && j < 2) {
# Setup and save each individual PCA plot
filename <- sprintf('bray_curtis_pcoa_results-BirthYear-%s-%s.png',
p[i, j]$labels$x,
p[i, j]$labels$y)
png(filename, width = 6, height = 6, unit='in', res=200)
sp <- p[i,j] + geom_text_repel() +
theme(legend.position = 'none',
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = sprintf('%s vs. %s',
p[i, j]$labels$x,
p[i, j]$labels$y),
subtitle = 'Colored by BirthYear') +
scale_color_gradient(low="#0039D0", high="#FF0000", name = "BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")
print(sp)
out <- dev.off()
}
}
}
Image("bray_curtis_pcoa_results-BirthYear.png")
Issue of alpha = 0.5
appearing in the legend: this is the expected behavior, since we are selecting two parameters to color within aes
: the gradient (based on df$variable
) and alpha fixed at 0.5 (so the points have some transparency). The legend reflects all aesthetics (parameters) that appear explicitly mentioned within aes
, so the solution is to remove the alpha
value from the aesthetic, and place it as a fixed value outside of it. In our case, we say the upper and the lower triangles of plots will show points with fixed alpha=0.5:
p <- ggpairs(df[,c(1:3)],
legend = 4,
upper = list(continuous = wrap("points", alpha=0.5)),
lower = list(continuous = wrap("points", alpha=0.5)),
aes(color = df$variable, label = rownames(df))) +
theme_bw() +
theme(legend.position = 'bottom',
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = 'PCA plot',
subtitle = 'Colored by BirthYear') +
scale_color_gradient(low="#0039D0", high="#FF0000", name="BirthYear", space = "Lab", na.value = "#888888", guide = "colorbar", aesthetics = "color")
so that fixes the issue of alpha showing up in legend. Is there any other issue here?
Yes, one more thing. The colorbar doesn’t appear under the individual PC plots that come after that one, only under the grid. If need be we could probably live without that though
Because there is again a legend.position=none
in the R code for individual PC plots. So if you remove that, you should also get it under those plots.
Is your feature request related to a problem? Please describe. With the removal of taxa plots on continuous variables, we may have reached a place where there are no typical runtime errors in the jupyter notebook. However, the legends for alpha/beta diversity on continuous variables are still pretty much unreadable.
Describe the solution you'd like Instead of a random color for each individual numerical entry, there needs to be a single color gradient scale for each of the continuous variables.