Closed olivierlabayle closed 1 year ago
Sure, we can do that. One thing is that it will not be much faster than you writing a function for that yourself. BGEN format has variant-by-variant compression, and the computational bottleneck is the decompression for each variant.
I need clarification here--are you working on phased data or unphased data? If unphased data, isn't it natural to store a dosage matrix rather than doing genotype calling?
My goal eventually would be to convert any probability array returned by probabilities!
to a CategoricalVector
(or similar if you don't want to depend on CategoricalArrays) of genotypes given a calling threshold. This would be for both phased and unphased data as I'd like to support both in my project.
I'm sorry for not getting back to you earlier. Are you planning to limit the use of it for human data (biallelic, diploid)? If so, it should be relatively simple than what I initially imagined.
I might use some ideas like this: https://www.cog-genomics.org/plink/2.0/input#dosage_import_settings
Sorry, I missed that. Yes at the moment my project is limited to human data!
Since the target is biallelic and diploid, I am adding a function that takes the dosage vector as an input and an integer genotype vector as an output, named hardcall!()
and hardcall()
. The documentation will be also available. Please note that the first_allele_dosage!()
function supports both phased and unphased data. This will be available in v0.1.13. Please let me know if you need further support.
Check out v0.1.14.
Hi,
Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.
In SnpArrays, there is the possibility to
convert
the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.Cheers, Olivier