genetics-statistics / GEMMA

Genome-wide Efficient Mixed Model Association
https://github.com/genetics-statistics/GEMMA
GNU General Public License v3.0
333 stars 125 forks source link

Singular Ve matrix #61

Closed davidhoule closed 6 years ago

davidhoule commented 7 years ago

I have installed the program and can run your example data set without problem. Similarly, I can run the program on small subsets of the SNPs with my phenotypic data. However, when I calculate relatedness based on the whole genome, the estimates of the relatedness matrix calculated in GEMMA appear to cause problems. I have diagnosed the relatedness matrix as the problem by using the full relatedness matrices in the sample example analyses of my data (two traits, three snps) that run with a relatedness matrix calculated from just the three snps.

The error is consistent, in that the program estimates a 0 Ve matrix, then crashes because of a singular matrix error: Start Eigen-Decomposition... REMLE estimate for Vg in the null model: 1.6815 0.1562 1.6745 se(Vg): 0.3784 -nan -nan REMLE estimate for Ve in the null model: 0.0000 0.0000 0.0000 se(Ve): 0.1228 -nan -nan REMLE likelihood = -478.9822 MLE estimate for Vg in the null model: 1.6815 0.1563 1.6745 se(Vg): 0.1758 0.1246 0.1751 MLE estimate for Ve in the null model: 0.0000 0.0000 0.0000 se(Ve): -nan 0.0000 0.0000 MLE likelihood = -140737488355560.3750 gsl: lu.c:262: ERROR: matrix is singular========================100.00%

It may be that the relatedness in my data set is the problem. It is certainly not what you find in human data. I am studying a set of 184 largely inbred lines, the Drosophila Genome Reference Panel. The majority of sites are fixed, but the proportion of heterozygotes is maybe 5% on average, but that varies among lines. In addition, about 5% of the line pairs are more related than second cousins, and a few seem to be full sibs. Bottom line is that genotypes are always very far from Hardy-Weinberg. I have not filtered the genome for high LD SNP pairs for the calculation of relatedness, although I am aware that this will pose problems for the actual genome-wide association analysis with more than a few SNPs.

I would be happy to furnish example data sets that create the problem if that would be helpful.

pcarbo commented 7 years ago

@davidhoule Your problem may be related to Issue #58. What version of GEMMA are you using, and which operating system?

davidhoule commented 7 years ago

I am using Ubuntu 16.04.2. I updated to version GEMMA 0.97, with the same results. Relatedness matrix that causes problems has just one (very near 0) negative eigenvalue. Bending matrix to all positive eigenvalues has no effect on the error. Other toy relatedness matrices with many negative near 0 eigenvalues run without problem.

pcarbo commented 7 years ago

@davidhoule Can you please try downloading the pre-compiled binary for GEMMA v0.96 Linux x86 and see if that works for you? We recently uncovered a bug with linking to GSL 2.x and the bugs looks very similar to the one you reported.

davidhoule commented 7 years ago

Tried the binary you furnished with the same result.

Then did an analysis using a relatedness matrix estimated in smartpca, part of the Eigensoft package. With this matrix, all versions of the program will run on my toy data set. Estimates of Ve are still 0, but tests are produced, and there are no errors:

Reading Files ...
## number of total individuals = 184
## number of analyzed individuals = 184
## number of covariates = 1
## number of phenotypes = 2
## number of total SNPs = 3
## number of analyzed SNPs = 3
Start Eigen-Decomposition...
REMLE estimate for Vg in the null model:
0.8298
0.0682  0.7795
se(Vg):
-nan
0.1064  0.1491
REMLE estimate for Ve in the null model:
0.0000
0.0000  0.0388
se(Ve):
-nan
0.0804  0.1126
REMLE likelihood = -475.4224
MLE estimate for Vg in the null model:
0.8298
0.0685  0.8233
se(Vg):
0.0868
0.0613  0.0861
MLE estimate for Ve in the null model:
0.0000
0.0000  0.0000
se(Ve):
0.0000
0.0000  0.0000
MLE likelihood = 70368744177431.5234
Reading SNPs  ==================================================100.00%

I assume that the two nan SE are still indicating a problem?

pcarbo commented 7 years ago

@davidhoule Great progress. It sounds like your siutation is related to Issue #45. In some cases it appears that a different method for computing the relatedness matrix yields more or less numerically stable results, although it is hard to say exactly what the root cause is. I believe that the computation requires tha all the eigenvalues to be positive. It might be useful to check this before running the LMM analysis in GEMMA with the -eigen option.

It is useful to look at output file result.log.txt rather than the console output because it gives you the numbers in more precision.

Did you try the univariate LMM analysis? Do you get this problem in the univariate LMM analysis?

davidhoule commented 7 years ago

Univariate analyses do not seem to have the problem. The actual log.txt file does show very slightly positive estimates of Ve terms, and of all VE SE, except the one nan.

I suspect that there must be an 'effective 0' tolerance, as I have examples that run with many very slightly negative eigenvalues.

Thanks for your help with this issue.

pjotrp commented 7 years ago

Possibly this is fixed by https://github.com/genetics-statistics/GEMMA/pull/69. Please try the next release. If it does not resolve send me a dataset.

pjotrp commented 7 years ago

@davidhoule can you send me the toy dataset?

pjotrp commented 7 years ago

No response

pjotrp commented 6 years ago

We have a dataset now for testing that shows similar characteristics. Running -lmm 2 renders

##
## GEMMA Version    = 0.98-pre1 (2018/06/29)
## GSL Version      = 2.4
## Eigen Version    = 3.2.9
## OpenBlas         = OpenBLAS 0.2.19  - NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell
##   arch           = Haswell
##   threads        = 8
##   parallel type  = threaded
##
## Command Line Input = ./bin/gemma -bfile ./frontiers -lmm 2 -k ./GWA_2017.centered.cXX.txt -maf 0.05 -outdir ./skull/ -snps selected_200_SNPs.txt -debug -legacy -no-check 
##
## Date = Fri Jun 29 13:38:52 2018
##
## Summary Statistics:
## number of total individuals = 49
## number of analyzed individuals = 49
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var = 73527
## number of analyzed SNPs/var = 200
## REMLE log-likelihood in the null model = 135.494
## MLE log-likelihood in the null model = 140.231
## pve estimate in the null model = 0.958692
## se(pve) in the null model = 0.037907
## vg estimate in the null model = 0.000473816
## ve estimate in the null model = 1.46539e-05
## beta estimate in the null model =   0.00169357
## se(beta) =   0.000546862
##
## Computation Time:
## total computation time = 0.0139996 min 
## computation time break down: 
##      time on eigen-decomposition = 0.00010455 min 
##      time on calculating UtX = 1.55333e-05 min 
##      time on optimization = 0.00404858 min 
##

while

~/.guix-profile/bin/gemma -bfile ./frontiers -lmm 1 -n 1 2 3 4 5 6 7 8 9 10 -k ./GWA_2017.centered.cXX.txt -maf 0.3 -outdir ./skull/ -snps selected_200_SNPs.txt -debug

renders

MLE estimate for Vg in the null model: 
0.0004
-0.0001 0.0002
-0.0000 -0.0000 0.0002
0.0000  0.0001  0.0000  0.0002
0.0000  0.0000  -0.0000 -0.0000 0.0001
-0.0000 0.0000  -0.0000 -0.0000 0.0000  0.0002
-0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0001
-0.0000 0.0000  -0.0000 0.0000  -0.0000 0.0000  0.0001  0.0001
-0.0000 0.0000  0.0000  0.0000  0.0000  0.0000  -0.0000 0.0000  0.0001
-0.0000 -0.0000 0.0000  -0.0000 -0.0000 -0.0000 -0.0000 0.0000  0.0000  0.0001
se(Vg): 
0.0001
0.0001  0.0001
0.0001  0.0001  0.0001
0.0001  0.0000  0.0001  0.0001
0.0000  0.0001  -nan    0.0000  0.0001
0.0000  0.0000  0.0001  0.0000  0.0000  0.0000
0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
0.0000  0.0000  0.0001  0.0000  0.0000  0.0000  0.0000  0.0000
0.0000  0.0000  0.0001  0.0000  0.0001  0.0000  0.0000  0.0000  0.0000
0.0000  0.0000  -nan    0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  -nan
MLE estimate for Ve in the null model: 
0.0000
0.0000  0.0001
-0.0000 0.0000  0.0001
-0.0000 -0.0000 -0.0000 0.0000
0.0000  -0.0000 0.0000  0.0000  0.0001
-0.0000 0.0000  0.0000  -0.0000 -0.0000 0.0000
0.0000  0.0000  -0.0000 0.0000  0.0000  0.0000  0.0000
0.0000  0.0000  -0.0000 -0.0000 0.0000  -0.0000 -0.0000 0.0000
0.0000  -0.0000 -0.0000 0.0000  -0.0000 0.0000  -0.0000 -0.0000 0.0000
0.0000  0.0000  -0.0000 0.0000  0.0000  -0.0000 0.0000  0.0000  -0.0000 0.0000
se(Ve): 
0.0000
0.0000  0.0000
0.0000  0.0000  0.0000
0.0000  0.0000  -nan    -nan
0.0000  0.0000  -nan    -nan    0.0000
0.0000  0.0000  0.0000  -nan    -nan    0.0000
0.0000  0.0000  -nan    -nan    0.0000  -nan    -nan
-nan    -nan    -nan    -nan    0.0000  -nan    0.0000  -nan
0.0000  0.0000  -nan    -nan    -nan    -nan    0.0000  -nan    -nan
0.0000  0.0000  -nan    -nan    -nan    0.0000  0.0000  -nan    0.0000  0.0000
MLE likelihood = -9060217478156.0625
GSL ERROR: matrix is singular in lu.c at line 266 errno 1

When I run with less phenotypes we get

 ~/.guix-profile/bin/gemma -bfile ./frontiers -lmm 1 -n 1 4 6   -k ./GWA_2017.centered.cXX.txt -maf 0.1 -outdir ./skull/ -snps selected_200_SNPs.txt -debug

##
## GEMMA Version    = 0.97 (2017/12/19)
## GSL Version      = 2.4
## Eigen Version    = 3.3.4
## OpenBlas         = OpenBLAS 0.3.0.dev  - NO_AFFINITY HASWELL
##   arch           = HASWELL
##   threads        = 8
##   parallel type  = threaded
##
## Command Line Input = /home/wrk/.guix-profile/bin/gemma -bfile ./frontiers -lmm 1 -n 1 4 6 -k ./GWA_2017.centered.cXX.txt -maf 0.1 -outdir ./skull/ -snps selected_200_SNPs.txt -debug 
##
## Date = Fri Jun 29 13:42:48 2018
##
## Summary Statistics:
## number of total individuals = 49
## number of analyzed individuals = 49
## number of covariates = 1
## number of phenotypes = 3
## number of total SNPs/var = 73527
## number of analyzed SNPs/var = 182
## REMLE log-likelihood in the null model = 447.058
## MLE log-likelihood in the null model = -1.78495e+09
## REMLE estimate for Vg in the null model: 
0.000441807
-7.55708e-06    0.000265048
1.5968e-05      -1.5201e-05     0.000153374
## se(Vg): 
0.000111597
6.46807e-05     6.10626e-05
5.13028e-05     2.98818e-05     5.03876e-05
## REMLE estimate for Ve in the null model: 
2.14138e-05
-3.21985e-06    1.68684e-06
-1.21203e-05    -1.80521e-06    1.82497e-05
## se(Ve): 
1.72819e-05
5.62131e-06     2.21151e-06
1.07007e-05     -nan    1.17464e-05
## MLE estimate for Vg in the null model: 
0.000455099     -1.55256e-05    1.40252e-05
-1.55256e-05    0.000269695     -1.75401e-05
1.40252e-05     -1.75401e-05    0.00016208
## se(Vg): 
0.000104231
5.57158e-05     5.63383e-05
4.66896e-05     3.62062e-05     4.24585e-05
## MLE estimate for Ve in the null model: 
1.62625e-05     -1.99967e-06    -1.05415e-05
-1.99967e-06    1.05748e-06     -1.24257e-06
-1.05415e-05    -1.24257e-06    1.47746e-05
## se(Ve): 
2.94059e-23
5.38615e-24     5.59049e-24
4.19386e-24     4.11504e-24     -nan
## estimate for B (d by c) in the null model (columns correspond to the covariates provided in the file): 
0.00169357
0.000552855
-0.000453467
## se(B): 
0.000661073
0.000185541
0.000610281
##
## Computation Time:
## total computation time = 0.0532221 min 
## computation time break down: 
##      time on eigen-decomposition = 7.95333e-05 min 
##      time on calculating UtX = 1.60833e-05 min 
##      time on optimization = 0.0286565 min 
##
pjotrp commented 6 years ago

Commit 70f419673d5d3e49a3eada70c70c2d284b502d7b shows that in all cases eigen decomposition throws FP exceptions when enabling hardware checking. This follows #161.

pjotrp commented 6 years ago

The problem is in calcPab which generates NaN in some cases. Not completely clear what it is yet, but I think it has to do with uninitialized W. Related to #94

pjotrp commented 6 years ago

I fixed the NaN issue in se(Ve) with above commit. Also for the multiple phenotypes we get an error where it matters:

./bin/gemma -bfile ./frontiers -lmm 1 -n 1 2 3 4 5 6 7 8 9 10 -k ./GWA_2017.centered.cXX.txt -maf 0.3 -outdir ./skull/ -snps selected_200_SNPs.txt -no-check
GEMMA 0.98-beta1 (2018-09-06) by Xiang Zhou and team (C) 2012-2018
Reading Files ... 
## number of total individuals = 49
## number of analyzed individuals = 49
## number of covariates = 1
## number of phenotypes = 10
## number of total SNPs/var        =    73527
## number of considered SNPS       =      200
## number of analyzed SNPs         =       82
Start Eigen-Decomposition...
**** WARNING: Matrix G has 3 eigenvalues close to zero in src/lapack.cpp at line 280 in EigenDecomp_Zeroed
**** WARNING: Matrix G has more than one negative eigenvalues! in src/lapack.cpp at line 284 in EigenDecomp_Zeroed
**** WARNING: Matrix G has 3 eigenvalues close to zero in src/lapack.cpp at line 280 in EigenDecomp_Zeroed
**** WARNING: Matrix G has more than one negative eigenvalues! in src/lapack.cpp at line 284 in EigenDecomp_Zeroed
ERROR: Enforce failed for Trying to take the log of -0.138747 in src/mathfunc.cpp at line 108 in safe_log

It may be possible to solve this, but it is a different issue.

hans-recknagel commented 5 years ago

I am running into a similar error message when running the multivariate linear model. Running the same phenotypes individually using a LMM does not result in any errors. Only difference is that I have excluded Individuals that have any incomplete phenotype rows in the multivariate analysis. I am running the latest version of gemma on a macOS Sierra.

gemma -bfile males_dots_imputed_mlm -k centered_relationship_matrix.cXX.txt -lmm 1 -n 1 2 -o mlm_male_dots GEMMA 0.98 (2018-09-28) by Xiang Zhou and team (C) 2012-2018 Reading Files ...

number of total individuals = 171

number of analyzed individuals = 171

number of covariates = 1

number of phenotypes = 2

number of total SNPs/var = 83655

number of analyzed SNPs = 83305

Start Eigen-Decomposition... REMLE estimate for Vg in the null model: 0.0101
1.3313 568.4998
se(Vg): 0.0047
0.9901 341.0558
REMLE estimate for Ve in the null model: 0.0009
0.0298 1.0678
se(Ve): 0.0012
0.2447 86.1451 REMLE likelihood = -404.9952 MLE estimate for Vg in the null model: 0.0136
1.4578 573.3371
se(Vg): 0.0015
0.2417 62.1871 MLE estimate for Ve in the null model: 0.0000
0.0000 0.0000
se(Ve): 0.0000
0.0000 0.0000
MLE likelihood = -198158383604301984.0000 GSL ERROR: matrix is singular in lu.c at line 262 errno 1

pjotrp commented 5 years ago

@hans-recknagel do you mind sharing the dataset with me so I can replicate the problem?

hans-recknagel commented 5 years ago

Yes, I've sent them to you by mail. Thanks.

pjotrp commented 5 years ago

Looks to me like the covariates are highly correlated. mvlmm does not like that.

pjotrp commented 5 years ago

See also #175

hans-recknagel commented 5 years ago

Ok, so correlated variables are not appropriate to use then? I thought because these two different variables capture the phenotype better than just one, but describe pretty much the same phenotype, it would be best to analyse them together.

pjotrp commented 5 years ago

You should not use correlated covariates/phenotypes. There is no benefit anyway. In #175 we'll write a validator which will emit the error.

hans-recknagel commented 5 years ago

Thanks for your response. I was under the impression that the mvlmm is actually for correlated phenotypes (this is what is says in the the paper by Zhou and Stevens 2014). The benefit in my particular case would be that in both phenotypic measures there is a measurement error, but by using the two different measures, the accuracy of getting the "true" value of the overall phenotype (and detecting its genetic basis) increases. This would decrease the chances of finding spurious genetic associations resulting from measurement errors.

pjotrp commented 5 years ago

This issue is closed. Please use the mailing list for discussion. You may get help there.