genetics-statistics / faster_lmm_d

A faster lmm for GWAS. Supports GPU backend.
GNU General Public License v3.0
14 stars 6 forks source link

BGEN support #37

Open prasunanand opened 6 years ago

prasunanand commented 6 years ago

In GEMMA, bgen support was added in PR.

However, there are no tests to validate the code so that I can port it to faster_lmm_d .

I need to test BGEN files with a 500k sample. I believe this would be a great exercise to test GPU support.

PS: This thread tracks the implementation of BGEN file support.

prasunanand commented 6 years ago

Might be helpful for future reference.

The following table tabulates features of various different formats:

PLINK binaryGENBGEN v1.1BGEN v1.2 / v1.3VCFBCF
Supports unphased genotype calls **
Supports unphased genotype probabilities
Supports NULL/outlier probability
e.g. NULL class from CHIAMO / GenoSNP
Supports non-diploid samples
Supports phased data?
Supports multi-allelic variants
Efficient representation?

Hard-called genotypes are converted to probabilities in GEN and BGEN v1.1. †By convention, males on the X chromosome are stored as homozygote females in GEN and BGEN v1.1. ‡At the time of writing, the storage of genotype likelihoods and probabilities for non-diploid samples and/or phased data in VCF/BCF is not fully specified.

Found this on http://www.well.ox.ac.uk/~gav/bgen_format/

pjotrp commented 6 years ago

It is also important how quickly file formats can be streamed for parallel processing. Binary formats typically do no better than compressed textual data here. I see that as a too early optimization ;).

I suspect for GEMMA we end up with our own R/qtl2 based format and convert from one of the above.

Computing probabilities is something we like to control. Also it is not a great idea to have GEMMA support multiple formats for reasons of maintenance. One type is enough. Conversion will be rapid so we can pipe it in.