BGEN support - Githubissues

prasunanand commented 6 years ago

In GEMMA, bgen support was added in PR.

However, there are no tests to validate the code so that I can port it to faster_lmm_d .

I need to test BGEN files with a 500k sample. I believe this would be a great exercise to test GPU support.

PS: This thread tracks the implementation of BGEN file support.

prasunanand commented 6 years ago

Might be helpful for future reference.

The following table tabulates features of various different formats:

	PLINK binary	GEN	BGEN v1.1	BGEN v1.2 / v1.3	VCF	BCF
Supports unphased genotype calls	✓	✓^*	✓^*	✓	✓	✓
Supports unphased genotype probabilities		✓	✓	✓	✓	✓
Supports NULL/outlier probability e.g. NULL class from CHIAMO / GenoSNP		✓	✓		✓	✓
Supports non-diploid samples		†	^†	✓	✓^‡	✓^‡
Supports phased data?				✓	✓^‡	✓^‡
Supports multi-allelic variants				✓	✓	✓
Efficient representation?	✓		✓	✓		✓

Hard-called genotypes are converted to probabilities in GEN and BGEN v1.1. †By convention, males on the X chromosome are stored as homozygote females in GEN and BGEN v1.1. ‡At the time of writing, the storage of genotype likelihoods and probabilities for non-diploid samples and/or phased data in VCF/BCF is not fully specified.

Found this on http://www.well.ox.ac.uk/~gav/bgen_format/

pjotrp commented 6 years ago

It is also important how quickly file formats can be streamed for parallel processing. Binary formats typically do no better than compressed textual data here. I see that as a too early optimization ;).

I suspect for GEMMA we end up with our own R/qtl2 based format and convert from one of the above.

Computing probabilities is something we like to control. Also it is not a great idea to have GEMMA support multiple formats for reasons of maintenance. One type is enough. Conversion will be rapid so we can pipe it in.

genetics-statistics / faster_lmm_d

BGEN support #37