brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Extracted genotypes format? #128

Closed lvclark closed 7 months ago

lvclark commented 7 months ago

Somalier extracts genotypes from BAMs so much faster than anything I've attempted to write. I would love to be able to use those genotypes in other analysis. (In my particular case, I need principal components to use as pop structure covariates in association analysis, where I am just running the analysis on a few genes and don't want to have to genotype the whole genome.) Could the format of somalier extract be documented a little more thoroughly so that someone like me could read those bytes into Python or R and convert them to numeric genotypes?

brentp commented 7 months ago

Hi, you can see the python code here: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py that reads the somalier files. Remember that it is only minimal information and not true genotypes. Note that function discards y-sites but you can see the format. Happy to answer any questions.

lvclark commented 7 months ago

Wonderful, thank you for the quick reply!