bacpop / ska.rust

Split k-mer analysis – version 2
https://docs.rs/ska/latest/ska/
Apache License 2.0
56 stars 4 forks source link

Add ska distance #26

Closed johnlees closed 1 year ago

johnlees commented 1 year ago

See older implementation here: https://github.com/simonrharris/SKA/wiki/ska-distance

Should be easy enough to do with XX^T, though another option would be to convert to sparse ACGT vecs.

What about ambiguous bases? Could just ignore, or could add 1/2 or 1/3 for 'partial' match (i.e. multiply probability vectors)

cammo0p commented 1 year ago

Regarding the handling of ambiguous bases, I think it would be useful to have an option to either ignore them or treat them as partial matches by multiplying probability vectors. This way, users can choose the method that best suits their specific use case and data quality.

In my case of using a SNP cutoff of 2 and a 96% similarity between k-mers (in v1 ska distance) , it might be more appropriate to treat ambiguous bases as partial matches by multiplying probability vectors. I guess many users would compare highly similar species.So if you have to choose one, allowing us to consider the uncertainty introduced by ambiguous bases while calculating the distances,may provide a more accurate representation of the true relationship between samples.

johnlees commented 1 year ago

This is almost ready in #35. You can filter ambiguous bases and a frequency cutoff if you wish, using the same options as ska align