Request: missing data handling

evotools / hapbin

Efficient program for calculating Extended Haplotype Homozygosity (EHH) and Integrated Haplotype Score (iHS)

GNU General Public License v3.0

41 stars 18 forks source link

Request: missing data handling #29

Open magicDGS opened 8 years ago

magicDGS commented 8 years ago

I need to perform an EHH/iHS scan in a dataset with around 30 haplotypes, but where almost every SNP have at least one haplotype with missing data. I know that the rehh R-package handle this removing this haplotypes till a concrete number of them are left. Nevertheless, rehh is quite slow for my dataset (1 million SNPs per chromosome).

Could it be possible to implement this behavior in hapbin? Thank you very much in advance!

camaclean commented 8 years ago

It isn't possible for two reasons: HapBin requires a regular NxM binary grid of data to function. Using binary strings and bitwise operations is how it gets its performance and lower memory footprint. Since there isn't a way to store anything other than 0 or 1 in a single bit, it isn't possible to store any other state like missing data. Secondly, the algorithm operates on bit-vectors of 64 and 256 haplotypes in single operations and doesn't store what the haplotypes actually are, so it's impossible to go back and remove haplotypes. It only knows the locations of haplotypes for the current and previous index.

camaclean commented 8 years ago

Actually, I suppose it would be possible to create a bit-mask to mask off the missing data. However, this would require re-calculating all the EHHs with a new mask each time missing data was encountered, as it wouldn't be known before hand how many loci away to calculate. The performance would depend upon how much data is missing.

magicDGS commented 8 years ago

@camaclean, thank you for your quick response. If it could be implemented with the second approach I will be very grateful, because I'll use it with my dataset.