abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
324 stars 22 forks source link

Feature Request: BallTree algorithm with haversine distance #237

Open blaylockbk opened 4 weeks ago

blaylockbk commented 4 weeks ago

This is a neat polars plug in. Thanks! Do you have plans for implementing the Ball Tree algorithm like in scikit-learn?

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html

The haversine distance metric to find nearest neighbors between latitude/longitude points on Earth would be very useful.

abstractqqq commented 4 weeks ago

Yea... I used to have haversine dist for kdtree.. Then I realized kdtree only works for LP distances..

It will take me quite some time before I can get to implementing the ball tree. The current focus is on regression and perfecting the kd-tree algorithms...

AtlasPilotPuppy commented 1 day ago

@blaylockbk @abstractqqq I would be interested in implementing this. The big question is what is the best format for the output. I am assuming the inputs will be a column ofPoints which can be structs containing an x and a y coordinate. Or a column for the x and another column for the y coordinate. Would the output contain the parent of each node if any and the distance to the parent ? What would be the prefered way to take the arguments for t,k,Q,B - as named arguments potentially.

abstractqqq commented 1 day ago

@AtlasPilotPuppy This is great! And btw I am working on bringing the current Rust kdtree impl to Python, like what SciPy offers. I expect to merge this into main this weekend. The benchmarks look good, with the Rust Kdtree being faster when k is large and much faster when run in parallel mode. In other cases, it is slightly slower.

In terms of format, can you follow what I have for query_radius_ptwise or query_knn_ptwise? I think @blaylockbk would be ok with the API I currently have.

When you implement the ball tree, it would be good if you can implement the SpatialQueries trait (in the arkadia subcrate.. The project structure is kind of messy now I know but please excuse it for the time being...), as that would reduce a lot of code redundancy. If that is impossible, or you propose some much better trait for these data structures, please let me know!

Some other thoughts I have: in the special case of haversine, we can likely use [f64; 2] which will be more efficient than a Vec. But I leave that decision to you.

abstractqqq commented 21 hours ago

@AtlasPilotPuppy This branch is where the current kdtree work lives..https://github.com/abstractqqq/polars_ds_extension/tree/pykd

In main, it is called SpacialQueries and lives in mod.rs. I realized it was a typo and should be "spatial" and fixed it in the pykd branch lol. Thank you for checking!

AtlasPilotPuppy commented 21 hours ago

Sorry I found it. It was just a spelling discrepancy in the comment. Thanks

abstractqqq commented 8 hours ago

Sorry I found it. It was just a spelling discrepancy in the comment. Thanks

I am merging the progress in the branch into main. I made quite a few small changes in arkadia. Hopefully it shouldn't cause too much conflict

AtlasPilotPuppy commented 2 hours ago

@abstractqqq @blaylockbk Wanted some input. I have a working implementation of BallTree although the api right now is not usable. So i can implement BallTree very similar to query_radius_ptwise or query_knn_ptwise But BallTree is somewhat expensive to compute and can be queried repeatedly. I am looking for input on What is the best way to structure it similar to the Sklearn BallTree API. This would allow us to create a model that can be queried repeatedly. The challenge is It does not fit very well into the DataFrame / Plugin paradigm. While we can provide an interface similar to query_radius_ptwise or query_knn_ptwise that will have to rebuild the tree for each query. My thought process at the moment is I can leave my implementation in there and wrap a polars plugins compatible api. We can figure out how to provide a sklearn style interface later.

abstractqqq commented 2 hours ago

@AtlasPilotPuppy you are exactly right in that this kind of stuff doesn't fit in dataframe/plugin well, however, I would argue that for small datasets, it is quite convenient to query within a dataframe, and typically for a few thousand knn searches, the speed gain can already justify the convenience to use this in dataframes. If people truely want to do vector search on huge dataset, they should go for LanceDB and other more sophisticated ANN solutions that usually involves local storage and more complicated space partitions.

I think once you have the balltree data structure ready, I can make it available in dataframes as queries. Maybe only for haversine distance because most of the spatial queries can be done with kdtree already and kdtrees are somewhat cheaper.

My recent branch pykd (merged into main already) has an implementation of Kdtree that works with numpy input. You can find the Python wrapper in src/pymodels/py_kdt.rs and in python/polars_ds/kdtree.py. Strictly speaking we don't need the additional wrapper in kdtree.py, but that is a very thin layer and provides linter and additional documentation to users. All SciPy kdtree's methods are available there too, with some name changes. I think this is the direction you want to go.

AtlasPilotPuppy commented 2 hours ago

Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.

abstractqqq commented 2 hours ago

Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.

And btw, unfortunately in order for Python Interop to work, you have to use OwnedLeaf because Python lifetime cannot be safely passed into Rust. So data has to be copied. Again, I don't think this is for large scale vector searches. So the copying is not a big concern.