When plotting multiple traits in a single Manhattan plot, overplotting is unavoidable. CMplot dampens its effect by randomly sampling 1000 points from each trait to plot at a time. However, it does so starting with all traits in the first chunk of points to plot and removes traits from the sampling (and thus plotting) list once they have been 'depleted' (i.e. all points for that trait have been plotted). Thus, if one trait has much more (non-NA) points that the other(s), it can dominate the last chunk of points plotted resulting in visible overplotting 'bias' towards that sample.
This PR addresses this issue by inverting the plot (or rather: sampling) order: 'Larger' traits are preferred in the initial chunk(s) of points to plot until equal numbers of points remain to be plotted for each trait. This way, the 'extra' points accumulate in the background instead of the foreground, removing the visible 'bias' caused by the overplotting.
Here is an example:
# Get latest development version of `CMplot` and `pig60K` example data.
library(CMplot)
source("https://raw.githubusercontent.com/YinLiLin/CMplot/fe3b0ed0130bac60d61cb23aaec778435c8d1bce/R/CMplot.r")
data(pig60K)
# Create mulit-tracks Manhattan plot with default parameters (for reference).
set.seed(42)
CMplot(pig60K, plot.type="m", multracks=TRUE, file.output=FALSE)
# Randomly drop p-values for most points in two out of three traits.
set.seed(42)
pig60Kmod <- pig60K
n <- nrow(pig60Kmod)
na1 <- sample(1:n, as.integer(.8 * n))
na3 <- sample(1:n, as.integer(.95 * n))
pig60Kmod$trait1[na1] <- NA
pig60Kmod$trait3[na3] <- NA
# Observe 'larger' trait visually dominate the plot.
set.seed(42)
CMplot(pig60Kmod, plot.type="m", multracks=TRUE, file.output=FALSE)
# Repeat the same plot with reversed sampling order as suggested in
# this PR.
source("https://raw.githubusercontent.com/YinLiLin/CMplot/a62c829fea8c4d74b609fcefb3dd8a73895ade26/R/CMplot.r")
set.seed(42)
CMplot(pig60Kmod, plot.type="m", multracks=TRUE, file.output=FALSE)
# Repeat plot with unmodified example data to rule out any unwanted
# side-effects of the reversed sampling.
set.seed(42)
CMplot(pig60K, plot.type="m", multracks=TRUE, file.output=FALSE)
When plotting multiple traits in a single Manhattan plot, overplotting is unavoidable. CMplot dampens its effect by randomly sampling 1000 points from each trait to plot at a time. However, it does so starting with all traits in the first chunk of points to plot and removes traits from the sampling (and thus plotting) list once they have been 'depleted' (i.e. all points for that trait have been plotted). Thus, if one trait has much more (non-NA) points that the other(s), it can dominate the last chunk of points plotted resulting in visible overplotting 'bias' towards that sample.
This PR addresses this issue by inverting the plot (or rather: sampling) order: 'Larger' traits are preferred in the initial chunk(s) of points to plot until equal numbers of points remain to be plotted for each trait. This way, the 'extra' points accumulate in the background instead of the foreground, removing the visible 'bias' caused by the overplotting.
Here is an example: