Closed johnlees closed 1 year ago
Regarding the handling of ambiguous bases, I think it would be useful to have an option to either ignore them or treat them as partial matches by multiplying probability vectors. This way, users can choose the method that best suits their specific use case and data quality.
In my case of using a SNP cutoff of 2 and a 96% similarity between k-mers (in v1 ska distance) , it might be more appropriate to treat ambiguous bases as partial matches by multiplying probability vectors. I guess many users would compare highly similar species.So if you have to choose one, allowing us to consider the uncertainty introduced by ambiguous bases while calculating the distances,may provide a more accurate representation of the true relationship between samples.
This is almost ready in #35. You can filter ambiguous bases and a frequency cutoff if you wish, using the same options as ska align
See older implementation here: https://github.com/simonrharris/SKA/wiki/ska-distance
Should be easy enough to do with XX^T, though another option would be to convert to sparse ACGT vecs.
What about ambiguous bases? Could just ignore, or could add 1/2 or 1/3 for 'partial' match (i.e. multiply probability vectors)