SimonSchlumbohm / HarmonizR

11 stars 3 forks source link

LFQ missing values question and error plotting #4

Closed HeldLab closed 1 year ago

HeldLab commented 1 year ago

Hi-

I'm new to harmonizR, but giving it a test. A question and a (perhaps) related issue. Thanks in advance! -Jason

1) I have LFQ data and have already imputed 0 for missing values prior to Harmonizer. Is that how you recommend handling data, or should I leave missing LFQ values empty?

2) When I run my data (imputed as above) through HarmonizR I'm getting this plotting error. I've attached my input files. Any thoughts?

[1] "Reading the files..." [1] "Preparing..." [1] "Splitting the data using ComBat adjustment..." Found 5 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.

Found2batches

Adjusting for0covariate(s) or covariate level(s)

Found91935Missing Data Values

Standardizing Data across genes

Fitting L/S model and finding priors

Finding parametric adjustments

Adjusting the Data

[1] "Rebuilding..." [1] "Visualizing sample means..."

Warning message in min(x): “no non-missing arguments to min; returning Inf” Warning message in max(x): “no non-missing arguments to max; returning -Inf” Warning message in min(x): “no non-missing arguments to min; returning Inf” Warning message in max(x): “no non-missing arguments to max; returning -Inf”

Error in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs): need finite 'ylim' values Traceback:

  1. harmonizR("dfMerged.txt", "dfMerged_description.csv", algorithm = "ComBat", . ComBat_mode = 1, plot = "samplemeans", output_file = "result_file")
  2. boxplot(original, main = "Original", las = 2, ylim = lmts)
  3. boxplot.default(original, main = "Original", las = 2, ylim = lmts)
  4. do.call(bxp, c(list(z, notch = notch, width = width, varwidth = varwidth, . log = log, border = border, pars = pars, outline = outline, . horizontal = horizontal, add = add, ann = ann, at = at), . args[namedargs]), quote = TRUE)
  5. (function (z, notch = FALSE, width = NULL, varwidth = FALSE, . outline = TRUE, notch.frac = 0.5, log = "", border = par("fg"), . pars = NULL, frame.plot = axes, horizontal = FALSE, ann = TRUE, . add = FALSE, at = NULL, show.names = NULL, ...) . { . pars <- as.list(pars) . if (...length()) { . nmsA <- names(args <- list(...)) . if (anyDuplicated(nmsA)) { . iD <- duplicated(nmsA) . warning(sprintf(ngettext(sum(iD), "Duplicated argument %s is disregarded", . "Duplicated arguments %s are disregarded"), sub("^list\((.)\)", . "\1", deparse(args[iD]))), domain = NA) . nmsA <- names(args <- args[!iD]) . } . pars[nmsA] <- args . } . bplt <- function(x, wid, stats, out, conf, notch, xlog, i) { . ok <- TRUE . if (!anyNA(stats)) { . xP <- if (xlog) . function(x, w) x exp(w) . else function(x, w) x + w . wid <- wid/2 . if (notch) { . ok <- stats[2L] <= conf[1L] && conf[2L] <= stats[4L] . xx <- xP(x, wid c(-1, 1, 1, notch.frac, 1, . 1, -1, -1, -notch.frac, -1)) . yy <- c(stats[c(2, 2)], conf[1L], stats[3L], . conf[2L], stats[c(4, 4)], conf[2L], stats[3L], . conf[1L]) . } . else { . xx <- xP(x, wid c(-1, 1, 1, -1)) . yy <- stats[c(2, 2, 4, 4)] . } . if (!notch) . notch.frac <- 1 . wntch <- notch.frac wid . xypolygon(xx, yy, lty = "blank", col = boxfill[i]) . xysegments(xP(x, -wntch), stats[3L], xP(x, +wntch), . stats[3L], lty = medlty[i], lwd = medlwd[i], . col = medcol[i], lend = 1) . xypoints(x, stats[3L], pch = medpch[i], cex = medcex[i], . col = medcol[i], bg = medbg[i]) . xysegments(rep.int(x, 2), stats[c(1, 5)], rep.int(x, . 2), stats[c(2, 4)], lty = whisklty[i], lwd = whisklwd[i], . col = whiskcol[i]) . xysegments(rep.int(xP(x, -wid staplewex[i]), 2), . stats[c(1, 5)], rep.int(xP(x, +wid staplewex[i]), . 2), stats[c(1, 5)], lty = staplelty[i], lwd = staplelwd[i], . col = staplecol[i]) . xypolygon(xx, yy, lty = boxlty[i], lwd = boxlwd[i], . border = boxcol[i]) . if ((nout <- length(out))) { . xysegments(rep(x - wid outwex, nout), out, . rep(x + wid outwex, nout), out, lty = outlty[i], . lwd = outlwd[i], col = outcol[i]) . xypoints(rep.int(x, nout), out, pch = outpch[i], . lwd = outlwd[i], cex = outcex[i], col = outcol[i], . bg = outbg[i]) . } . if (any(inf <- !is.finite(out))) { . warning(sprintf(ngettext(length(unique(out[inf])), . "Outlier (%s) in boxplot %d is not drawn", . "Outliers (%s) in boxplot %d are not drawn"), . paste(unique(out[inf]), collapse = ", "), i), . domain = NA) . } . } . return(ok) . } . if (!is.list(z) || 0L == (n <- length(z$n))) . stop("invalid first argument") . if (is.null(at)) . at <- 1L:n . else if (length(at) != n) . stop(gettextf("'at' must have same length as 'z$n', i.e. %d", . n), domain = NA) . if (is.null(z$out)) . z$out <- numeric() . if (is.null(z$group) || !outline) . z$group <- integer() . if (is.null(pars$ylim)) . ylim <- range(z$stats[is.finite(z$stats)], if (outline) z$out[is.finite(z$out)], . if (notch) z$conf[is.finite(z$conf)]) . else { . ylim <- pars$ylim . pars$ylim <- NULL . } . if (length(border) == 0L) . border <- par("fg") . dev.hold() . on.exit(dev.flush()) . if (!add) { . if (is.null(pars$xlim)) . xlim <- range(at, finite = TRUE) + c(-0.5, 0.5) . else { . xlim <- pars$xlim . pars$xlim <- NULL . } . plot.new() . if (horizontal) . plot.window(ylim = xlim, xlim = ylim, log = log, . xaxs = pars$yaxs) . else plot.window(xlim = xlim, ylim = ylim, log = log, . yaxs = pars$yaxs) . } . xlog <- (par("ylog") && horizontal) || (par("xlog") && !horizontal) . pcycle <- function(p, def1, def2 = NULL) rep(if (length(p)) p else if (length(def1)) def1 else def2, . length.out = n) . p <- function(sym) pars[[sym, exact = TRUE]] . boxlty <- pcycle(pars$boxlty, p("lty"), par("lty")) . boxlwd <- pcycle(pars$boxlwd, p("lwd"), par("lwd")) . boxcol <- pcycle(pars$boxcol, border) . boxfill <- pcycle(pars$boxfill, par("bg")) . boxwex <- pcycle(pars$boxwex, 0.8 { . if (n <= 1) . 1 . else stats::quantile(diff(sort(if (xlog) . log(at) . else at)), 0.1) . }) . medlty <- pcycle(pars$medlty, p("lty"), par("lty")) . medlwd <- pcycle(pars$medlwd, 3 p("lwd"), 3 par("lwd")) . medpch <- pcycle(pars$medpch, NAinteger) . medcex <- pcycle(pars$medcex, p("cex"), par("cex")) . medcol <- pcycle(pars$medcol, border) . medbg <- pcycle(pars$medbg, p("bg"), par("bg")) . whisklty <- pcycle(pars$whisklty, p("lty"), "dashed") . whisklwd <- pcycle(pars$whisklwd, p("lwd"), par("lwd")) . whiskcol <- pcycle(pars$whiskcol, border) . staplelty <- pcycle(pars$staplelty, p("lty"), par("lty")) . staplelwd <- pcycle(pars$staplelwd, p("lwd"), par("lwd")) . staplecol <- pcycle(pars$staplecol, border) . staplewex <- pcycle(pars$staplewex, 0.5) . outlty <- pcycle(pars$outlty, "blank") . outlwd <- pcycle(pars$outlwd, p("lwd"), par("lwd")) . outpch <- pcycle(pars$outpch, p("pch"), par("pch")) . outcex <- pcycle(pars$outcex, p("cex"), par("cex")) . outcol <- pcycle(pars$outcol, border) . outbg <- pcycle(pars$outbg, p("bg"), par("bg")) . outwex <- pcycle(pars$outwex, 0.5) . width <- if (!is.null(width)) { . if (length(width) != n || anyNA(width) || any(width <= . 0)) . stop("invalid boxplot widths") . boxwex width/max(width) . } . else if (varwidth) . boxwex sqrt(z$n/max(z$n)) . else if (n == 1) . 0.5 * boxwex . else rep.int(boxwex, n) . if (horizontal) { . xypoints <- function(x, y, ...) points(y, x, ...) . xypolygon <- function(x, y, ...) polygon(y, x, ...) . xysegments <- function(x0, y0, x1, y1, ...) segments(y0, . x0, y1, x1, ...) . } . else { . xypoints <- points . xypolygon <- polygon . xysegments <- segments . } . ok <- TRUE . for (i in 1L:n) ok <- ok & bplt(at[i], wid = width[i], stats = z$stats[, . i], out = z$out[z$group == i], conf = z$conf[, i], notch = notch, . xlog = xlog, i = i) . if (!ok) . warning("some notches went outside hinges ('box'): maybe set notch=FALSE") . axes <- is.null(pars$axes) . if (!axes) { . axes <- pars$axes . pars$axes <- NULL . } . if (axes) { . ax.pars <- pars[names(pars) %in% c("xaxt", "yaxt", "xaxp", . "yaxp", "gap.axis", "las", "cex.axis", "col.axis", . "format")] . if (is.null(show.names)) . show.names <- n > 1 . if (show.names) . do.call("axis", c(list(side = 1 + horizontal, at = at, . labels = z$names), ax.pars), quote = TRUE) . do.call("Axis", c(list(x = z$stats, side = 2 - horizontal), . ax.pars), quote = TRUE) . } . if (ann) . do.call(title, pars[names(pars) %in% c("main", "cex.main", . "col.main", "sub", "cex.sub", "col.sub", "xlab", . "ylab", "cex.lab", "col.lab")], quote = TRUE) . if (frame.plot) . box() . invisible(at) . })(base::quote(list(stats = structure(c(Inf, Inf, Inf, Inf, -Inf, . Inf, Inf, Inf, Inf, -Inf), dim = c(5L, 2L)), n = c(90, 153), . conf = structure(c(NaN, NaN, NaN, NaN), dim = c(2L, 2L)), . out = c(Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, . Inf, Inf, Inf, Inf, Inf), group = c(1, 1, 1, 1, 1, 1, 1, . 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, . 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, . 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, . 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, . 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, . 2, 2, 2, 2, 2, 2, 2, 2), names = c("Batch_1", "Batch_2"))), . notch = base::quote(FALSE), width = base::quote(NULL), varwidth = base::quote(FALSE), . log = base::quote(""), border = base::quote("black"), pars = base::quote(list( . boxwex = 0.8, staplewex = 0.5, outwex = 0.5, boxfill = "lightgray")), . outline = base::quote(TRUE), horizontal = base::quote(FALSE), . add = base::quote(FALSE), ann = base::quote(TRUE), at = base::quote(NULL), . main = base::quote dfMerged.txt ("Original"), las = base::quote(2), ylim = base::quote(c(Inf, . Inf)))
  6. plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs)

dfMerged.txt dfMerged_description.csv

SimonSchlumbohm commented 1 year ago

Dear HeldLab,

your issue might arise from the input data not undergoing any prior normalization. Usually, ComBat / limma - and therefore the HarmonizR algorithm - expect some kind of normalized data (for example using the log() function in R on the data prior).

Note: Since using the natural logarithm of zero turns to -inf in R, all zeroes must be turned to NA prior to this. Therefore, you do not have to set them to zero in the first place. The HarmonizR does expect NAs. Zeroes will not be treated as missing, but as a valid, measured expression of a numerical 0. This is due to there being a difference between technical and biological zeroes for example in single-cell RNA-seq data. I would have included the normalization steps in the HarmonizR already, but I can never be sure about the specific user input given :)

So basically, to fix the issue:

This is the plot I get when I applied the steps to your data: HeldLab.pdf

The plot Error occured, since HarmonizR tries to reverse a log-normalization on the data (just for the plotting!). So it calculated 2^dataframe for a dataframe with already high values, which led to inf values, which could not be plotted.

I hope I could help with the issue. If it persists, please feel free to contact me again!

Best regards

Simon

HeldLab commented 1 year ago

Thanks! Got everything working, and I'm very impressed with clustering/normalization. But, another question. What if I have metadata (rows) in the results I want to Harmonize (see attached, rows 2-6)? Is this dealwithable (I'm getting an error, pasted below)? I tend to work in python and then hop over to R for Harmonizer, and then hop back over to python. There are solutions to this problem, but I wanted to see if there were approaches to deal with metadata rows in Harmonizer more seamlessly.

Error: Error in read.table(data_source, sep = "\t", header = TRUE, row.names = 1, : duplicate 'row.names' are not allowed Traceback:

  1. harmonizR("dfHarmonizer.txt", "dfMerged_description.csv")
  2. read_main_data(data_as_input)
  3. read.table(data_source, sep = "\t", header = TRUE, row.names = 1, . check.names = FALSE, comment.char = "", stringsAsFactors = FALSE)
  4. stop("duplicate 'row.names' are not allowed")

dfHarmonizer.txt

SimonSchlumbohm commented 1 year ago

Unfortunately, there curretly is no functionality implemented to account for metadata. HarmonizR simply checks the rows for numerical value presence and treats them all as Features (proteins/genes/etc.). Other data types get casted to numeric if possible and turn to NaN elsewise.

(The error occurs, since the metadata row.names appear to be empty strings (as far as I saw). Usually all Protein/Gene entries differ in name or identifier (their row.names). Here, the underlying adjustment gets duplicate row.names.)

Best regards

Simon