famuvie / breedR

Statistical methods for forest genetic resources analysts
http://famuvie.github.io/breedR/
GNU General Public License v3.0
31 stars 24 forks source link

Genomic prediction issues #109

Closed GregorDall closed 2 years ago

GregorDall commented 2 years ago

Dear Facundo,

we are trying to implement GBLUP / rrBLUP using breedR but seem to have stumbled upon an issue. In some cases breedR produces unreasonable results and also takes comparably longer to produce an output. Here is an example comparing the results obained by rrBLUP and breedR. Is there any way you can explain this behaviour or to avoid it? I have also tried using "em", but the results are the same

Cheers Gregor

IMAGE

` library(breedR) library(rrBLUP)

result_all <- NULL

for(i in 1:100){

cat(paste("Run",i,"at",date(),"\n"))

random population of 200 lines with 1000 markers

M <- matrix(rep(0,200*1000),200,1000) for (i in 1:200) { M[i,] <- ifelse(runif(1000)<0.5,-1,1) }

random phenotypes

u <- rnorm(1000) g <- as.vector(crossprod(t(M),u)) h2 <- 0.5 #heritability y <- g + rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))

predict breeding values with rrBLUP

G <- A.mat(M) ans <- mixed.solve(y,K=G) accuracy <- cor(g,ans$u)

predict breeding values with breedR

model <- remlf90(fixed = y ~ 1, generic = list(gen = list(diag(200),G)), method = 'ai', data = as.data.frame(y)) accuracy_b <- cor(g,model$ranef$gen[[1]]$value)

result <- data.frame(accuracy,accuracy_b) result_all <- rbind(result_all,result)

} summary(result_all)#expectation r~sqrt(0.5)~0.7071068 length(which(result_all$accuracy>result_all$accuracy_b)) length(which(result_all$accuracy<0.5)) length(which(result_all$accuracy_b<0.5))

par(mfrow=c(1,3)) hist(result_all$accuracy,main="rrBLUP",xlim=c(-0.25,1),xlab="prediction accuracy") hist(result_all$accuracy_b,main="breedR",xlim=c(-0.25,1),xlab="prediction accuracy") plot(result_all$accuracy,result_all$accuracy_b,xlim=c(-0.25,1),ylim=c(-0.25,1), xlab="rrBLUP",ylab="breedR") abline(a=0,b=1,col="red") `

famuvie commented 2 years ago

Dear Gregor,

Thanks for your interesting question, and for your clearly stated reproducible example!

You compare results of accuracy of predicted genetic values from breedR and rrBLUP out of simulated data and show that for about 23 % of the cases, the results from breedR collapse down to about 0, whereas the results from rrBLUP remain around the expected value.

I think the issue boils down to numerical instability, where rrBLUP stands more robustly thanks to (I'm guessing) its built-in shrinkage capabilities that prevent variance estimates to collapse. However, you are measuring accuracy via the correlation between predicted and true genetic values, which is not necessarily a good measure of predictive performance. As I show below, breedR and rrBLUP provide very similar predictions, even in the cases where accuracies are very different.

To demonstrate this, let me examine more in detail the results of one of those badly behaving cases:

# Packages ----------------------------------------------------------------
library(breedR)
#> Loading required package: sp
library(rrBLUP)

# Functions ---------------------------------------------------------------
draw_M <- function(rows, cols) {
  values <- 1 - 2 * (runif(rows * cols) < 0.5)
  return(matrix(values, rows, cols))
}

simulate_data <- function(n_obs, n_marks, h2) {
  M <- draw_M(n_obs, n_marks)
  u <- rnorm(n_marks)
  g <- as.vector(M %*% u)
  y <- g + rnorm(n_obs, mean = 0, sd = sqrt((1 - h2) / h2 * var(g)))

  structure(
    data.frame(g, y),
    G = A.mat(M)
  )  
}

# A badly behaved case ----------------------------------------------------
set.seed(20220403)

baddat <- simulate_data(n_obs = 200, n_marks = 1e3, h2 = 0.5)

ans <- mixed.solve(baddat$y, K = attr(baddat, 'G'))
(accuracy <- cor(baddat$g, ans$u))
#> [1] 0.7073303

#predict breeding values with breedR
model <- remlf90(
  fixed = y ~ 1,
  generic = list(gen = list(diag(nrow(baddat)), attr(baddat, 'G'))), 
  method = 'ai',
  data = baddat
)
#> Using default initial variances given by default_initial_variance()
#> See ?breedR.getOption.
(accuracy_b <- cor(baddat$g, model$ranef$gen[[1]]$value))
#> [1] -0.04447799

preds <- baddat |> 
  transform(
    pred_rr = ans$u,
    pred_breedR = model$ranef$gen[[1]]$value
  )

pairs(data.matrix(preds), xlim = c(-150, 150), ylim = c(-150, 150))

## The variance of the genetic effect was estimated at essentially 0,
## collapsing all estimated genetic values at 0 (except for 1 that deviates a
## little bit, for some numerical artefact or idiosyncrasy).

summary(model)
#> Formula: y ~ 0 + Intercept 
#>    Data: baddat 
#>   AIC  BIC logLik
#>  2038 2045  -1017
#> 
#> Parameters of special components:
#> 
#> 
#> Variance components:
#>          Estimated variances      S.E.
#> gen                1.299e-04 1.306e-05
#> Residual           1.720e+03 1.720e+02
#> 
#> Fixed effects:
#>             value s.e.
#> Intercept -3.4838 2.94

Created on 2022-04-04 by the reprex package (v2.0.1)

As you see, the accuracies from breedR and rrBLUP are dramatically different (near 0 and 70 % respectively). Yet, the predicted genetic values are essentially the same (see bottom-right panels of the pairs-plot, where predictions from both programs are basically collapsed over the same spot, when compared to the variation of the true genetic values).

breedR's estimate of the variance has collapsed to zero, and all the predicted values are essentially 0, except for one slightly different value, which honestly surprised me, but must have something to do with the idiosyncrasies of the underlying algorithm. This is responsible of the violent bad accuracy (= correlation). But, as you can see, it is simply an artefact of the very bad-conditioned data, which rrBLUP handles certainly more elegantly.

famuvie commented 2 years ago

To prove that the issue has to do with numerical-instabilities, here is the same exercise, but using 1000 observations rather than only 200:

# Packages ----------------------------------------------------------------
library(breedR)
#> Loading required package: sp
library(rrBLUP)

# Functions ---------------------------------------------------------------
draw_M <- function(rows, cols) {
  values <- 1 - 2 * (runif(rows * cols) < 0.5)
  return(matrix(values, rows, cols))
}

simulate_data <- function(n_obs, n_marks, h2) {
  M <- draw_M(n_obs, n_marks)
  u <- rnorm(n_marks)
  g <- as.vector(M %*% u)
  y <- g + rnorm(n_obs, mean = 0, sd = sqrt((1 - h2) / h2 * var(g)))

  structure(
    data.frame(g, y),
    G = A.mat(M)
  )  
}

# More informative data ---------------------------------------------------

n_sim = 1e2
n_obsx = 1e3
accuraciesx <- data.frame(rrblup = numeric(n_sim), breedr = numeric(n_sim))

for(i in seq.int(n_sim)){

  cat(paste("Run", i, "at", date(), "\n"))

  dat <- simulate_data(n_obs = n_obsx, n_marks = 1e3, h2 = 0.5)

  ans <- mixed.solve(dat$y, K = attr(dat, 'G'))
  accuracy <- cor(dat$g, ans$u)

  #predict breeding values with breedR
  model <- remlf90(
    fixed = y ~ 1,
    generic = list(gen = list(diag(nrow(dat)), attr(dat, 'G'))), 
    method = 'ai',
    data = dat
  )
  accuracy_b <- cor(dat$g, model$ranef$gen[[1]]$value)

  accuraciesx[i, ] <- c(accuracy,accuracy_b)

}
#> Run 1 at Mon Apr  4 17:00:25 2022
#> [...]
#> Run 100 at Mon Apr  4 17:33:47 2022
#> Using default initial variances given by default_initial_variance()
#> See ?breedR.getOption.

plot(accuraciesx)
abline(0, 1)

Created on 2022-04-04 by the reprex package (v2.0.1)

As you can see, the accuracies are consistent from both programs when data are sufficiently informative.

famuvie commented 2 years ago

In conclusion, breedR can be sensitive to numerical instabilities one variances are very small or data are badly-conditioned. There is not much work around that, other than getting more data, simplifying the model, or using other algorithms/software that handle these situations specifically (e.g. ridge-regression, shrinkage priors in a Bayesian setting). I hope this helps.

GregorDall commented 2 years ago

Dear facundo,

thank you for your efforts and your quick response. So essentially you are saying that breedR or the underlying suite of programs are not equipped to perform "ridge regression" but "ordinary least squares regression" which is known to be unstable in the presence of many correlated predictors.

That solves the case for me, and tells me that I can not use breedR in this case (Genomic prediction with p > n). However, it remains a valuable tool in other areas (phenotypic analysis in plant variety evaluation), so thank you again for providing it to the public.

I wish you all the best. Gregor

P.S. that example is from a colleague of mine, Thanks Sebastian!

famuvie commented 2 years ago

breedR does not do "ridge regression" but REML (which is not OLS either, but anyway). This means that it has no safety belt for extreme badly-conditioned matrices.

Yet, it doesn't mean that it cannot be used for genomic prediction with p > n. Indeed, I've shown in the first example above that the predicted genetic values are essentially the same as those from rrBLUP. So, I'd say this rather confirms that it is a rather safe choice, even with p > n.

An estimate of a variance that is practically 0 is a red flag for numerical issues and you need to proceed with caution. It's just something to keep in mind when interpreting results or diagnosing issues.

Hope it helps. ƒacu.-

GregorDall commented 2 years ago

I'll proceed with caution then. Maybe checking the variance estimates is good practice in any case. Of yourse it's not OLS for a mixed model. Thanks for the reminder 😅

Am 04.04.2022 in 20:53, Facundo Muñoz @.***> schrieb:

breedR does not do "ridge regression" but REML (which is not OLS either, but anyway). This means that it has no safety belt for extreme badly-conditioned matrices.

Yet, it doesn't mean that it cannot be used for genomic prediction with p > n. Indeed, I've shown in the first example above that the predicted genetic values are essentially the same as those from rrBLUP. So, I'd say this rather confirms that it is a rather safe choice, even with p > n.

An estimate of a variance that is practically 0 is a red flag for numerical issues and you need to proceed with caution. It's just something to keep in mind when interpreting results or diagnosing issues.

Hope it helps. ƒacu.-

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>