genetics-statistics / GEMMA

Genome-wide Efficient Mixed Model Association
https://github.com/genetics-statistics/GEMMA
GNU General Public License v3.0
325 stars 124 forks source link

Issues when working with large genotype files #127

Open pcarbo opened 6 years ago

pcarbo commented 6 years ago

@pjotrp I'm testing GEMMA on a large GWAS data set with 11,950 samples and 528,969 SNPs. This testing was motivated by Issue #120 but I think deserves its own issue, so I'm posting one separately. For now, I will just record my observations, and we can decide if they are worth investigating.

$ ~/git/gemma/bin/gemma -bfile celiac -gk 1 -n 1
GEMMA 0.97 (2017/12/20) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
## number of total individuals = 11950
## number of analyzed individuals = 11950
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =   528969
## number of analyzed SNPs         =   527831
Calculating Relatedness Matrix ...
================================================== 100%
$ ~/git/gemma/bin/gemma -bfile celiac -n 1 -bslmm 1
GEMMA 0.97 (2017/12/20) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
## number of total individuals = 11950
## number of analyzed individuals = 11950
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =   528969
## number of analyzed SNPs         =   527831
pjotrp commented 6 years ago

If the computation is fast K it is likely the final transpose that is slow, not the writing to disk.

As for the second problem, use a debugger to interrupt and you can see where it is busy. My work hardly touched bslmm, but it may be openblas that is playing up.

Both points ought to be confirmed in gdb. I'll try and give you an example today.

Annefeng commented 6 years ago

Hi, I also encountered the same problem when running bslmm and came across your posts in google group and here. In case this would help to investigate the issue: The data I have contains ~8000 individuals and ~500K SNPs. Estimating GRM took almost a whole day (yes, 24 hours..), and running bslmm stopped at "Calculating UtX..." with a "4309 Segmentation fault (core dumped)" error. The program did stall a while after the "Reading Files" step.

I've been trying different prediction methods and only very few can run efficiently on the data at hand. Given the ever-increasing sample size and #markers, it would be nice if you can make BSLMM more scalable for large GWAS prediction. Thanks!

pjotrp commented 6 years ago

related to #173