dcgerard / ldsep

Linkage Disequilibrium Shrinkage Estimation for Polyploids
https://dcgerard.github.io/ldsep/
GNU General Public License v3.0
8 stars 2 forks source link

mldest() with type = "comp" returning all NAs #9

Open etnite opened 1 year ago

etnite commented 1 year ago

Hello @dcgerard, I've been experimenting with using spdep to estimate LD in autotetraploid data (from alfalfa). I seem to be having issues with mldest() when using type = "comp". I have attached an .Rdata file of updog output from a test set of 10 SNPs x 190 samples in an autotetraploid (alfalfa). Generally the expectation is that LD will be quite low here. In addition, this data contains some missing data (up to about 20% per SNP). I have used spdep in the past and haven't run into this sort of issue:

test_data_updog_out.zip

library(updog)
library(ldsep)

## Test data set - 190 samples x 10 SNPs, autotetraploid
load("test_data_updog_out.Rdata")

## Filter to remove SNPs that are close to monomorphic (all SNPs retained)
mout <- filter_snp(x = mout, pmax(Pr_0, Pr_1, Pr_2, Pr_3, Pr_4) <= 0.95)

## LD estimates with mldest(type = "comp") - returns all NA
varnames <- paste0("logL_", 0:4)
lik_array <- format_multidog(x = mout, varname = varnames)
comp_ld <- mldest(geno = lik_array, K = 4, type = "comp")
comp_ld[1:5, 1:10]
  i j         snpi         snpj  D D_se r2 r2_se  r r_se
1 1 2 Chr3_4195646 Chr3_4455597 NA   NA NA    NA NA   NA
2 1 3 Chr3_4195646 Chr3_4941871 NA   NA NA    NA NA   NA
3 1 4 Chr3_4195646 Chr3_8019653 NA   NA NA    NA NA   NA
4 1 5 Chr3_4195646 Chr3_8019658 NA   NA NA    NA NA   NA
5 1 6 Chr3_4195646 Chr3_8019685 NA   NA NA    NA NA   NA

## Setting type = "hap" seems to work (not certain this pop. is in HWE though)
hap_ld <- mldest(geno = lik_array, K = 4, type = "hap")
hap_ld[1:5, 1:10]
  i j         snpi         snpj            D        D_se           r2       r2_se           r       r_se
1 1 2 Chr3_4195646 Chr3_4455597  0.012273494 0.014887155 0.0029975893 0.007267187  0.05475025 0.06636671
2 1 3 Chr3_4195646 Chr3_4941871 -0.010900657 0.005651760 0.0089609512 0.008968134 -0.09466230 0.04736909
3 1 4 Chr3_4195646 Chr3_8019653  0.009070880 0.006339540 0.0061553065 0.008476187  0.07845576 0.05401889
4 1 5 Chr3_4195646 Chr3_8019658 -0.005190879 0.008278918 0.0011571276 0.003685947 -0.03401658 0.05417869
5 1 6 Chr3_4195646 Chr3_8019685 -0.003424743 0.011617499 0.0002320152 0.001574239 -0.01523205 0.05167524

## LD estimates with ldfast() - also works
gp <- format_multidog(x = mout, varname = paste0("Pr_", 0:4))
fast_ld <- ldfast(gp = gp, type = "r2")
fast_ld$ldmat[1:5, 1:5]
            [,1]         [,2]        [,3]         [,4]        [,5]
[1,] 1.000000000 0.0082786964 0.023578043 0.0143961731 0.001651024
[2,] 0.008278696 1.0000000000 0.024813398 0.0009218834 0.003221771
[3,] 0.023578043 0.0248133985 1.000000000 0.0052631221 0.018301113
[4,] 0.014396173 0.0009218834 0.005263122 1.0000000000 0.014373023
[5,] 0.001651024 0.0032217707 0.018301113 0.0143730228 1.000000000

Output of sessionInfo():

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3;  LAPACK version 3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldsep_2.1.5 updog_2.1.3

loaded via a namespace (and not attached):
 [1] Matrix_1.5-4.1           gtable_0.3.3             future.apply_1.10.0      dplyr_1.1.2             
 [5] compiler_4.3.0           tidyselect_1.2.0         Rcpp_1.0.10              stringr_1.5.0           
 [9] parallel_4.3.0           assertthat_0.2.1         globals_0.16.2           doRNG_1.8.6             
[13] scales_1.2.1             lattice_0.21-8           plyr_1.8.8               ggplot2_3.4.2           
[17] R6_2.5.1                 generics_0.1.3           mixsqp_0.3-48            iterators_1.0.14        
[21] future_1.32.0            tibble_3.2.1             munsell_0.5.0            pillar_1.9.0            
[25] ggthemes_4.2.4           rlang_1.1.1              utf8_1.2.3               SQUAREM_2021.1          
[29] stringi_1.7.12           invgamma_1.1             truncnorm_1.0-9          cli_3.6.1               
[33] magrittr_2.0.3           digest_0.6.31            foreach_1.5.2            grid_4.3.0              
[37] irlba_2.3.5.1            rstudioapi_0.14          doFuture_1.0.0           lifecycle_1.0.3         
[41] RcppArmadillo_0.12.2.0.0 vctrs_0.6.2              glue_1.6.2               listenv_0.9.0           
[45] codetools_0.2-19         rngtools_1.5.2           abind_1.4-5              parallelly_1.35.0       
[49] fansi_1.0.4              colorspace_2.1-0         ashr_2.2-54              reshape2_1.4.4          
[53] purrr_1.0.1              matrixStats_0.63.0       tools_4.3.0              pkgconfig_2.0.3 
dcgerard commented 1 year ago

Hey @etnite! Thanks for trying out {ldsep}!

I am not sure what’s going on. When I run the same code, I seem to not be getting an NA’s. You already have the latest version of ldsep, so I don’t think that’s the issue. Maybe you could try to update your other packages?

Same Setup:

library(updog)
library(ldsep)
library(corrplot)
load("test_data_updog_out.Rdata")
mout <- filter_snp(x = mout, pmax(Pr_0, Pr_1, Pr_2, Pr_3, Pr_4) <= 0.95)

LD Estimation Seems to Work

type = "comp" seems to work

varnames <- paste0("logL_", 0:4)
lik_array <- format_multidog(x = mout, varname = varnames)
comp_ld <- mldest(geno = lik_array, K = 4, type = "comp", se = FALSE)
comp_ld[1:5, 1:10]
##   i j         snpi         snpj         D D_se        r2 r2_se        r r_se
## 1 1 2 Chr3_4195646 Chr3_4455597  0.021996   NA 0.0036639    NA  0.06053   NA
## 2 1 3 Chr3_4195646 Chr3_4941871 -0.008399   NA 0.0023956    NA -0.04895   NA
## 3 1 4 Chr3_4195646 Chr3_8019653 -0.006490   NA 0.0014088    NA -0.03753   NA
## 4 1 5 Chr3_4195646 Chr3_8019658 -0.011978   NA 0.0022952    NA -0.04791   NA
## 5 1 6 Chr3_4195646 Chr3_8019685 -0.006980   NA 0.0002618    NA -0.01618   NA

None seem to be missing:

mle_ld <- format_lddf(obj = comp_ld, element = "r2")
anyNA(mle_ld[upper.tri(mle_ld, diag = FALSE)])
## [1] FALSE
mle_ld[1:5, 1:5]
##              Chr3_4195646 Chr3_4455597 Chr3_4941871 Chr3_8019653 Chr3_8019658
## Chr3_4195646           NA     0.003664     0.002396    0.0014088     0.002295
## Chr3_4455597           NA           NA     0.013187    0.0008344     0.001087
## Chr3_4941871           NA           NA           NA    0.0028881     0.002796
## Chr3_8019653           NA           NA           NA           NA     0.003126
## Chr3_8019658           NA           NA           NA           NA           NA

LD estimates with ldfast()

gp <- format_multidog(x = mout, varname = paste0("Pr_", 0:4))
fast_ld <- ldfast(gp = gp, type = "r2", shrinkrr = FALSE, se = FALSE)
mom_ld <- fast_ld$ldmat
fast_ld$ldmat[1:5, 1:5]
##          [,1]      [,2]     [,3]      [,4]     [,5]
## [1,] 1.000000 0.0082803 0.023704 0.0144133 0.001651
## [2,] 0.008280 1.0000000 0.024941 0.0009228 0.003222
## [3,] 0.023704 0.0249415 1.000000 0.0052956 0.018396
## [4,] 0.014413 0.0009228 0.005296 1.0000000 0.014388
## [5,] 0.001651 0.0032218 0.018396 0.0143875 1.000000

Compare estimates

plot(mom_ld[upper.tri(mom_ld, diag = FALSE)],
     mle_ld[upper.tri(mle_ld, diag = FALSE)],
     xlim = c(0, 0.4),
     ylim = c(0, 0.4),
     xlab = "mom",
     ylab = "mle")
abline(0, 1, col = 2)

unnamed-chunk-5-1

corrplot(corr = mom_ld,
         method = "color",
         col.lim = c(0, 1), 
         title = "mom", 
         type = "upper", 
         diag = FALSE)

unnamed-chunk-6-1

corrplot(corr = mle_ld,
         method = "color",
         col.lim = c(0, 1), 
         title = "mle", 
         type = "upper",
         diag = FALSE)

unnamed-chunk-6-2

What about the SNPs

I made se = FALSE because I was getting warnings from a singular Hessian. This occurs in the LD calculation between SNPs 1 and 9.

ga <- lik_array[1, , ]
gb <- lik_array[9, , ]
ld19_1 <- ldest(ga = ga, gb = gb, K = 4, type = "comp")
## Warning in sqrt(diag(gradval %*% -solve(oout$hessian) %*% t(gradval))): NaNs
## produced
ld19_2 <- ldest(ga = ga, gb = gb, K = 4, type = "comp", se = FALSE)

Session Info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.92 ldsep_2.1.5   updog_2.1.3  
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.3               future_1.33.0            generics_0.1.3          
##  [4] lpSolve_5.6.18           stringi_1.7.12           listenv_0.9.0           
##  [7] digest_0.6.33            magrittr_2.0.3           evaluate_0.21           
## [10] grid_4.3.1               iterators_1.0.14         fastmap_1.1.1           
## [13] foreach_1.5.2            plyr_1.8.8               ggthemes_4.2.4          
## [16] purrr_1.0.1              fansi_1.0.4              doRNG_1.8.6             
## [19] scales_1.2.1             codetools_0.2-19         cli_3.6.1               
## [22] rlang_1.1.1              parallelly_1.36.0        future.apply_1.11.0     
## [25] munsell_0.5.0            RcppArmadillo_0.12.4.1.0 yaml_2.3.7              
## [28] tools_4.3.1              parallel_4.3.1           reshape2_1.4.4          
## [31] doFuture_1.0.0           dplyr_1.1.2              colorspace_2.1-0        
## [34] ggplot2_3.4.2            rngtools_1.5.2           globals_0.16.2          
## [37] assertthat_0.2.1         vctrs_0.6.3              R6_2.5.1                
## [40] matrixStats_1.0.0        lifecycle_1.0.3          stringr_1.5.0           
## [43] pkgconfig_2.0.3          pillar_1.9.0             gtable_0.3.3            
## [46] glue_1.6.2               Rcpp_1.0.11              xfun_0.39               
## [49] tibble_3.2.1             tidyselect_1.2.0         highr_0.10              
## [52] rstudioapi_0.14          knitr_1.43               htmltools_0.5.5         
## [55] rmarkdown_2.23           compiler_4.3.1
etnite commented 1 year ago

Thank you @dcgerard, I have done some updating but the issue still remains. It's a bit of a mystery for now. I'm now running R 4.3.1, and am able to update all packages to their latest versions (except for curl, though I doubt that would be causing the issue).

The only other thing I noticed is that our versions of Ubuntu are different - 22.04 vs. 18.04. Our LAPACK versions are also different - 3.10.0 vs. 3.7.1. I may be updating this computer to Ubuntu 22.04 soon, so I can try testing again after that transition.