Open prasunanand opened 6 years ago
Might be helpful for future reference.
The following table tabulates features of various different formats:
PLINK binary | GEN | BGEN v1.1 | BGEN v1.2 / v1.3 | VCF | BCF | |
---|---|---|---|---|---|---|
Supports unphased genotype calls | ✓ | ✓* | ✓* | ✓ | ✓ | ✓ |
Supports unphased genotype probabilities | ✓ | ✓ | ✓ | ✓ | ✓ | |
Supports NULL/outlier probability e.g. NULL class from CHIAMO / GenoSNP |
✓ | ✓ | ✓ | ✓ | ||
Supports non-diploid samples | † | † | ✓ | ✓‡ | ✓‡ | |
Supports phased data? | ✓ | ✓‡ | ✓‡ | |||
Supports multi-allelic variants | ✓ | ✓ | ✓ | |||
Efficient representation? | ✓ | ✓ | ✓ | ✓ |
Hard-called genotypes are converted to probabilities in GEN and BGEN v1.1. †By convention, males on the X chromosome are stored as homozygote females in GEN and BGEN v1.1. ‡At the time of writing, the storage of genotype likelihoods and probabilities for non-diploid samples and/or phased data in VCF/BCF is not fully specified.
Found this on http://www.well.ox.ac.uk/~gav/bgen_format/
It is also important how quickly file formats can be streamed for parallel processing. Binary formats typically do no better than compressed textual data here. I see that as a too early optimization ;).
I suspect for GEMMA we end up with our own R/qtl2 based format and convert from one of the above.
Computing probabilities is something we like to control. Also it is not a great idea to have GEMMA support multiple formats for reasons of maintenance. One type is enough. Conversion will be rapid so we can pipe it in.
In
GEMMA
, bgen support was added in PR.However, there are no tests to validate the code so that I can port it to
faster_lmm_d
.I need to test BGEN files with a 500k sample. I believe this would be a great exercise to test GPU support.
PS: This thread tracks the implementation of BGEN file support.