Issues when working with large genotype files

pcarbo commented 6 years ago

@pjotrp I'm testing GEMMA on a large GWAS data set with 11,950 samples and 528,969 SNPs. This testing was motivated by Issue #120 but I think deserves its own issue, so I'm posting one separately. For now, I will just record my observations, and we can decide if they are worth investigating.

It took an exceptionally long time to write the 11,950 x 11,950 kinship/relatedness matrix (much longer than it took to compute it). Is there something that can be done to make it faster? I'm not familiar with the existing implementation for writing numbers to file. At the very least, we should add a message about writing to file so it doesn't look like the computation has hung. This is the console output:

$ ~/git/gemma/bin/gemma -bfile celiac -gk 1 -n 1
GEMMA 0.97 (2017/12/20) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
## number of total individuals = 11950
## number of analyzed individuals = 11950
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =   528969
## number of analyzed SNPs         =   527831
Calculating Relatedness Matrix ...
================================================== 100%

As for the BSLMM fitting, I did not get the expected error; instead, it stalled just after the "reading files" step, and at some point I gave up and killed the run. This was my output:

$ ~/git/gemma/bin/gemma -bfile celiac -n 1 -bslmm 1
GEMMA 0.97 (2017/12/20) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
## number of total individuals = 11950
## number of analyzed individuals = 11950
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =   528969
## number of analyzed SNPs         =   527831

pjotrp commented 6 years ago

If the computation is fast K it is likely the final transpose that is slow, not the writing to disk.

As for the second problem, use a debugger to interrupt and you can see where it is busy. My work hardly touched bslmm, but it may be openblas that is playing up.

Both points ought to be confirmed in gdb. I'll try and give you an example today.

Annefeng commented 6 years ago

Hi, I also encountered the same problem when running bslmm and came across your posts in google group and here. In case this would help to investigate the issue: The data I have contains ~8000 individuals and ~500K SNPs. Estimating GRM took almost a whole day (yes, 24 hours..), and running bslmm stopped at "Calculating UtX..." with a "4309 Segmentation fault (core dumped)" error. The program did stall a while after the "Reading Files" step.

I've been trying different prediction methods and only very few can run efficiently on the data at hand. Given the ever-increasing sample size and #markers, it would be nice if you can make BSLMM more scalable for large GWAS prediction. Thanks!

pjotrp commented 6 years ago

related to #173

genetics-statistics / GEMMA

Issues when working with large genotype files #127