OpenMendel / BGEN.jl

MIT License
10 stars 0 forks source link

convert to called genotypes #29

Closed olivierlabayle closed 1 year ago

olivierlabayle commented 1 year ago

Hi,

Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.

In SnpArrays, there is the possibility to convert the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.

Cheers, Olivier

kose-y commented 1 year ago

Sure, we can do that. One thing is that it will not be much faster than you writing a function for that yourself. BGEN format has variant-by-variant compression, and the computational bottleneck is the decompression for each variant.

kose-y commented 1 year ago

I need clarification here--are you working on phased data or unphased data? If unphased data, isn't it natural to store a dosage matrix rather than doing genotype calling?

olivierlabayle commented 1 year ago

My goal eventually would be to convert any probability array returned by probabilities! to a CategoricalVector (or similar if you don't want to depend on CategoricalArrays) of genotypes given a calling threshold. This would be for both phased and unphased data as I'd like to support both in my project.

kose-y commented 1 year ago

I'm sorry for not getting back to you earlier. Are you planning to limit the use of it for human data (biallelic, diploid)? If so, it should be relatively simple than what I initially imagined.

I might use some ideas like this: https://www.cog-genomics.org/plink/2.0/input#dosage_import_settings

olivierlabayle commented 1 year ago

Sorry, I missed that. Yes at the moment my project is limited to human data!

kose-y commented 1 year ago

Since the target is biallelic and diploid, I am adding a function that takes the dosage vector as an input and an integer genotype vector as an output, named hardcall!() and hardcall(). The documentation will be also available. Please note that the first_allele_dosage!() function supports both phased and unphased data. This will be available in v0.1.13. Please let me know if you need further support.

kose-y commented 1 year ago

Check out v0.1.14.