abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
266 stars 18 forks source link

KNN for string distance metrics #50

Closed coinflip112 closed 5 months ago

coinflip112 commented 5 months ago

It would be extremely useful to use knn_ptwise on a string column (with string distance metrics). This should be feasible in principle, right? Levenshtein distance for example defines a metric space. Does KDTree allow for different distances?

abstractqqq commented 5 months ago

Yes theoretically it is possible but the kdtree package currently does not allow strings. I can maybe use [u8] to represent strings, but the Levenshtein distances in this package is based on RapidFuzz's implementation, which is based on chars instead of [u8] so that it is correct even in other languages like Russian. In short, to do that is possible but requires way too much hacking right now...

coinflip112 commented 5 months ago

Makes sense. Haven't worked in rust before but it's on my "would like to but don't have time for list". If it ever does I could take a crack at it 😅

abstractqqq commented 5 months ago

I am closing this issue since I won't be implementing it for now. But it is something on my mind and feel free to pin me if you find something that can facilitate the implementation. Thanks!

abstractqqq commented 5 months ago

An inefficient implementation has been added. Please refer to the examples. I believe it is available >= 0.2.3

coinflip112 commented 5 months ago

Nice! Thanks! Tested and works exactly as I'd expect :) 🙏