famuvie / breedR

Statistical methods for forest genetic resources analysts
http://famuvie.github.io/breedR/
GNU General Public License v3.0
31 stars 24 forks source link

missing pedigree information and missing phenotypes #96

Closed treebreeder closed 6 years ago

treebreeder commented 6 years ago

I am just wondering how BreedR handles data with mixed pedigree information: for instance, I have data where the mother tree is known in most cases, but in some cases this information is missing. So i coded the pedigree as 0 for both "mum" and "dad". Does breedR automatically remove such data from calculations? I tested two versions for calculation, one with such data and one without, but I get pretty different results. And maybe one more question: is missing data for the phenotypes allowed? I see that the algorithm works, but the same happens as mentioned above: results differ a lot, when I remove the missing data before by myself!

famuvie commented 6 years ago

This question has been addressed in the breedR guide Missing values But I had not compared the results from an equivalent model with observations removed.

Missing data in the pedigree

It is legitimate and allowed to not know the kinship for a subset of individuals.

Encoding those values as NA will model those individuals as if they were founders. This will pull down the estimated genetic variance (and heritability) a bit.

In contrast, removing the observations will yield more consistent results. But on the other hand you won't get BV predictions for those individuals.

The choice is up to you, and might depend on the relative number of observations with missing kinship. See worked example below.


pacman::p_load(
  breedR,
  tidyverse,
  GGally
)
theme_set(theme_bw())

set.seed(1234)
dat <- breedR.sample.phenotype(
  fixed = c(mu = 10, x = 2),
  genetic = list(
    model    = 'add_animal',
    Nparents = c(10, 10),
    sigma2_a = 2,
    check.factorial = FALSE
  ),
  N = 1e3
)

## remove founders from observations
fulldat <- dat[!with(dat, is.na(sire) & is.na(dam)), ]

## randomly remove some kinship data
miss_fraction <- 1/5
missdat <- fulldat
set.seed(1234)
idx <- sample(nrow(fulldat), size = nrow(fulldat)*miss_fraction)
missdat[idx, c("sire", "dam")] <- NA

## remove observations with missing kinship data
rmdat <- fulldat[-idx, ]

## fit a model with all datasets and compare resutls
fitdat <- function(dat) {
  remlf90(
    phenotype ~ 1 + x,
    genetic = list(
      model = 'add_animal',
      pedigree = dat[, 1:3],
      id    = 'self'),
    data = dat
  )
}

res <- lapply(list(full = fulldat, miss = missdat, rm = rmdat), fitdat)
#> Warning in build_pedigree(1:3, data = ped.df): The pedigree has been
#> recoded. Check attr(ped, 'map').

## We need to recover the predicted BV in the right position for the last fit
## map from original to internal codes
map.codes <- attr(get_pedigree(res$rm), "map")
rmBV <- NA[seq.int(nrow(dat))]
rmBV <- ranef(res$rm)$genetic[map.codes]

## Summary of results
lapply(res, summary)
#> $full
#> Formula: phenotype ~ 0 + Intercept + x + pedigree 
#>    Data: dat 
#>   AIC  BIC logLik
#>  3583 3593  -1790
#> 
#> Parameters of special components:
#> 
#> 
#> Variance components:
#>          Estimated variances   S.E.
#> genetic                1.831 0.6368
#> Residual               1.059 0.3311
#> 
#>              Estimate   S.E.
#> Heritability   0.6159 0.1599
#> 
#> Fixed effects:
#>            value   s.e.
#> Intercept 9.6411 0.3157
#> x         1.7808 0.1540
#> 
#> $miss
#> Formula: phenotype ~ 0 + Intercept + x + pedigree 
#>    Data: dat 
#>   AIC  BIC logLik
#>  3647 3657  -1822
#> 
#> Parameters of special components:
#> 
#> 
#> Variance components:
#>          Estimated variances   S.E.
#> genetic                1.536 0.3844
#> Residual               1.213 0.2341
#> 
#>              Estimate   S.E.
#> Heritability   0.5522 0.1086
#> 
#> Fixed effects:
#>            value   s.e.
#> Intercept 9.6585 0.1349
#> x         1.8140 0.1588
#> 
#> $rm
#> Formula: phenotype ~ 0 + Intercept + x + pedigree 
#>    Data: dat 
#>   AIC  BIC logLik
#>  2884 2893  -1440
#> 
#> Parameters of special components:
#> 
#> 
#> Variance components:
#>          Estimated variances   S.E.
#> genetic                1.836 0.6456
#> Residual               1.079 0.3390
#> 
#>              Estimate   S.E.
#> Heritability   0.6123 0.1616
#> 
#> Fixed effects:
#>            value   s.e.
#> Intercept 9.5990 0.3196
#> x         1.8277 0.1735

data.frame(
  true_BV = dat$BV,
  full_BV = as.numeric(ranef(res$full)$genetic),
  miss_BV = as.numeric(ranef(res$miss)$genetic),
  rm_BV = as.numeric(rmBV)
) %>% 
  ggpairs()

Here with 20% of missing kinship the results are very similar. The predicted breeding values for the concerned individuals are considerably different, but no more than the prediction error.

Missing data in the phenotype

It is also legitimate and allowed to have missing phenotypes. Contrary to the previous case, here the results should be identical to a model fitted to the same dataset with the missing observations removed.

library(breedR)
#> Loading required package: sp
N <- 1e3
x <- rep(1:4, each = N/4)
fulldat <- data.frame(y = x + rnorm(N),
                      x = factor(letters[x]))

set.seed(1234)
idx <- sample(nrow(fulldat), size = nrow(fulldat)/5)

## dataset with missing phenotypes
missdat <- fulldat
missdat$y[idx] <- NA
summary(missdat)
#>        y          x      
#>  Min.   :-1.879   a:250  
#>  1st Qu.: 1.487   b:250  
#>  Median : 2.529   c:250  
#>  Mean   : 2.492   d:250  
#>  3rd Qu.: 3.482          
#>  Max.   : 7.451          
#>  NA's   :200

## dataset with observations removed
rmdat <- na.exclude(missdat)

## fit a model with all datasets and compare resutls
fitdat <- function(dat) {
  remlf90(y ~ 1 + x, data = dat)
}

res <- lapply(list(full = fulldat, miss = missdat, rm = rmdat), fitdat)

## Summary of results
lapply(res, summary)
#> $full
#> Formula: y ~ 0 + x 
#>    Data: dat 
#>   AIC  BIC logLik
#>  2864 2869  -1431
#> 
#> 
#> Variance components:
#>          Estimated variances    S.E.
#> Residual               1.014 0.04543
#> 
#> Fixed effects:
#>      value   s.e.
#> x.a 1.0038 0.0637
#> x.b 1.9509 0.0637
#> x.c 2.8769 0.0637
#> x.d 4.0096 0.0637
#> 
#> $miss
#> Formula: y ~ 0 + x 
#>    Data: dat 
#>   AIC  BIC logLik
#>  2653 2658  -1326
#> 
#> 
#> Variance components:
#>          Estimated variances    S.E.
#> Residual               1.005 0.05037
#> 
#> Fixed effects:
#>      value   s.e.
#> x.a 1.0425 0.0707
#> x.b 2.0304 0.0720
#> x.c 2.8682 0.0700
#> x.d 4.0104 0.0709
#> 
#> $rm
#> Formula: y ~ 0 + x 
#>    Data: dat 
#>   AIC  BIC logLik
#>  2285 2290  -1142
#> 
#> 
#> Variance components:
#>          Estimated variances    S.E.
#> Residual               1.005 0.05037
#> 
#> Fixed effects:
#>      value   s.e.
#> x.a 1.0425 0.0707
#> x.b 2.0304 0.0720
#> x.c 2.8682 0.0700
#> x.d 4.0104 0.0709
famuvie commented 6 years ago

Just added a comment in the "Missing values" guide to warn about the underestimation of the genetic variance when the number of missing relationships becomes important.