Add ska distance - Githubissues

johnlees commented 1 year ago

See older implementation here: https://github.com/simonrharris/SKA/wiki/ska-distance

Should be easy enough to do with XX^T, though another option would be to convert to sparse ACGT vecs.

Matches: ACGT equal
SNPs: ACGT but not equal
Mismatches: '-' and any non-'-'

What about ambiguous bases? Could just ignore, or could add 1/2 or 1/3 for 'partial' match (i.e. multiply probability vectors)

cammo0p commented 1 year ago

Regarding the handling of ambiguous bases, I think it would be useful to have an option to either ignore them or treat them as partial matches by multiplying probability vectors. This way, users can choose the method that best suits their specific use case and data quality.

In my case of using a SNP cutoff of 2 and a 96% similarity between k-mers (in v1 ska distance) , it might be more appropriate to treat ambiguous bases as partial matches by multiplying probability vectors. I guess many users would compare highly similar species.So if you have to choose one, allowing us to consider the uncertainty introduced by ambiguous bases while calculating the distances,may provide a more accurate representation of the true relationship between samples.

johnlees commented 1 year ago

This is almost ready in #35. You can filter ambiguous bases and a frequency cutoff if you wish, using the same options as ska align

bacpop / ska.rust

Add ska distance #26