genetics-statistics / GEMMA

Genome-wide Efficient Mixed Model Association
https://github.com/genetics-statistics/GEMMA
GNU General Public License v3.0
316 stars 122 forks source link

Multivariate GEMMA hangs forever #243

Open biona001 opened 3 years ago

biona001 commented 3 years ago

On a simulated dataset with 1000 samples, 10000 SNPs, and 2 traits, multivariate GEMMA ran for over an hour. Is this reasonable?

My commands are:

# generate relatedness matrix
gemma -bfile multivariate_2traits -gk 1 -o multivariate

# run gemma
gemma -bfile multivariate_2traits -k output/multivariate.cXX.txt -maf 0.0000001 -lmm 4 -n 1 2 -o gemma.polygenic.result

Output: (the 100% shows up almost immediately, but hangs forever and never displays the "done").

GEMMA 0.98.4 (2021-01-29) by Xiang Zhou and team (C) 2012-2021
Reading Files ...
## number of total individuals = 1000
## number of analyzed individuals = 1000
## number of covariates = 1
## number of phenotypes = 2
## number of total SNPs/var        =    10000
## number of analyzed SNPs         =    10000
Start Eigen-Decomposition...
REMLE estimate for Vg in the null model:
23.5963
6.2719  23.8002
se(Vg):
4.4394
2.7440  5.4686
REMLE estimate for Ve in the null model:
2.6355
0.0002  0.0002
se(Ve):
1.3386
0.8088  1.6734
REMLE likelihood = -4988.2498
MLE estimate for Vg in the null model:
25.3766
6.2899  23.8008
se(Vg):
5.8054
0.8981  1.0649
MLE estimate for Ve in the null model:
2.0757
0.0011  0.0000
se(Ve):
1.7524
0.0009  0.0000
MLE likelihood = -4981.3187
================================================== 100%

If I analyze the traits separately, the analysis finishes in ~5 seconds. I assume this is a bug but I'm not sure.

For reproducibility, my test PLINK files are zipped and attached.

multivariate.zip

biona001 commented 3 years ago

So I left the program running overnight, and it finished after 176 minutes. I am still suspicious it took so long with such small data, but I suppose this is not a bug. I will ask on the Google group instead.

FWIW, the output is

## REMLE estimate for Vg in the null model:
23.5963
6.27189 23.8002
## se(Vg):
4.43943
2.74395 5.46859
## REMLE estimate for Ve in the null model:
2.63552
0.00017173      0.000237989
## se(Ve):
1.33862
0.808768        1.67335
## MLE estimate for Vg in the null model:
25.3766 6.28991
6.28991 23.8008
## se(Vg):
5.80539
0.898146        1.06494
## MLE estimate for Ve in the null model:
2.07569 0.00105176
0.00105176      5.43669e-07
## se(Ve):
1.75241
0.000879422     4.27974e-07
## estimate for B (d by c) in the null model (columns correspond to the covariates provided in the file):
0.506032
1.00318
## se(B):
0.0513373
0.000487841
##
## Computation Time:
## total computation time = 176.405 min
## computation time break down:
##      time on eigen-decomposition = 0.0822538 min
##      time on calculating UtX = 0.0785359 min
##      time on optimization = 176.135 min
##
biona001 commented 3 years ago

I did 100 more simulations, and 25/100 simulations ran for over 1000 seconds. In contrast, the other 75 simulations usually run in 30-60 seconds.

I am reopening this issue, but feel free to close it.

pjotrp commented 2 years ago

Thank you for your data file. Not sure why timings are so divergent. Much of the time is spent in:

#4  0x0000000000480f62 in MphEM (func_name=func_name@entry=82 'R', max_iter=1000, max_prec=0.001,
    eval=eval@entry=0x5233c0, X=X@entry=0x540930, Y=Y@entry=0x5408d0, U_hat=U_hat@entry=0x53e380,
    E_hat=0x53e3e0, OmegaU=0x53e440, OmegaE=0x53e4a0, UltVehiY=0x53e500, UltVehiBX=0x53e560,
    UltVehiU=0x53e5c0, UltVehiE=0x53e620, V_g=0x540990, V_e=0x540a20, B=0x540ab0) at src/mvlmm.cpp:704

We need to look into optimizing the mvlmm code - there should be ample opportunity.

biona001 commented 2 years ago

Sorry I forgot about this issue. FWIT, I think the main problem might be how I did the simulation. I believe it was something like

b ~ N(0, 1) # for k position of b only, others are 0
yi ~ N(xi^T * b, 1)

where yi is length 2 vector of phenotype, xi is length 10000 genotype vector (entries 0, 1 or 2), and b is length 10000 effect size vector but only k position of b is drawn from standard Normal. This might be a poor simulation model for GEMMA. Later I changed the simulation model and the divergence problem went away.

pjotrp commented 2 years ago

It is good to have edge cases. I'll see what I can do with it :)

pjotrp commented 2 years ago

Tracking any work here

=> https://github.com/genenetwork/gn-gemtext-threads/blob/main/issues/gemma/multivariate_gemma_hangs-issue243.gmi