Open magicDGS opened 8 years ago
It isn't possible for two reasons: HapBin requires a regular NxM binary grid of data to function. Using binary strings and bitwise operations is how it gets its performance and lower memory footprint. Since there isn't a way to store anything other than 0 or 1 in a single bit, it isn't possible to store any other state like missing data. Secondly, the algorithm operates on bit-vectors of 64 and 256 haplotypes in single operations and doesn't store what the haplotypes actually are, so it's impossible to go back and remove haplotypes. It only knows the locations of haplotypes for the current and previous index.
Actually, I suppose it would be possible to create a bit-mask to mask off the missing data. However, this would require re-calculating all the EHHs with a new mask each time missing data was encountered, as it wouldn't be known before hand how many loci away to calculate. The performance would depend upon how much data is missing.
@camaclean, thank you for your quick response. If it could be implemented with the second approach I will be very grateful, because I'll use it with my dataset.
I need to perform an EHH/iHS scan in a dataset with around 30 haplotypes, but where almost every SNP have at least one haplotype with missing data. I know that the
rehh
R-package handle this removing this haplotypes till a concrete number of them are left. Nevertheless,rehh
is quite slow for my dataset (1 million SNPs per chromosome).Could it be possible to implement this behavior in hapbin? Thank you very much in advance!