Open blaylockbk opened 4 weeks ago
Yea... I used to have haversine dist for kdtree.. Then I realized kdtree only works for LP distances..
It will take me quite some time before I can get to implementing the ball tree. The current focus is on regression and perfecting the kd-tree algorithms...
@blaylockbk @abstractqqq I would be interested in implementing this. The big question is what is the best format for the output. I am assuming the inputs will be a column ofPoints which can be structs containing an x and a y coordinate. Or a column for the x and another column for the y coordinate. Would the output contain the parent of each node if any and the distance to the parent ? What would be the prefered way to take the arguments for t,k,Q,B - as named arguments potentially.
@AtlasPilotPuppy This is great! And btw I am working on bringing the current Rust kdtree impl to Python, like what SciPy offers. I expect to merge this into main this weekend. The benchmarks look good, with the Rust Kdtree being faster when k is large and much faster when run in parallel mode. In other cases, it is slightly slower.
In terms of format, can you follow what I have for query_radius_ptwise
or query_knn_ptwise
? I think @blaylockbk would be ok with the API I currently have.
When you implement the ball tree, it would be good if you can implement the SpatialQueries trait (in the arkadia subcrate.. The project structure is kind of messy now I know but please excuse it for the time being...), as that would reduce a lot of code redundancy. If that is impossible, or you propose some much better trait for these data structures, please let me know!
Some other thoughts I have: in the special case of haversine, we can likely use [f64; 2] which will be more efficient than a Vec. But I leave that decision to you.
@AtlasPilotPuppy This branch is where the current kdtree work lives..https://github.com/abstractqqq/polars_ds_extension/tree/pykd
In main, it is called SpacialQueries and lives in mod.rs. I realized it was a typo and should be "spatial" and fixed it in the pykd branch lol. Thank you for checking!
Sorry I found it. It was just a spelling discrepancy in the comment. Thanks
Sorry I found it. It was just a spelling discrepancy in the comment. Thanks
I am merging the progress in the branch into main. I made quite a few small changes in arkadia. Hopefully it shouldn't cause too much conflict
@abstractqqq @blaylockbk Wanted some input.
I have a working implementation of BallTree although the api right now is not usable.
So i can implement BallTree very similar to query_radius_ptwise
or query_knn_ptwise
But BallTree is somewhat expensive to compute and can be queried repeatedly.
I am looking for input on What is the best way to structure it similar to the Sklearn BallTree API. This would allow us to create a model that can be queried repeatedly. The challenge is It does not fit very well into the DataFrame / Plugin paradigm.
While we can provide an interface similar to query_radius_ptwise
or query_knn_ptwise
that will have to rebuild the tree for each query.
My thought process at the moment is I can leave my implementation in there and wrap a polars plugins compatible api. We can figure out how to provide a sklearn style interface later.
@AtlasPilotPuppy you are exactly right in that this kind of stuff doesn't fit in dataframe/plugin well, however, I would argue that for small datasets, it is quite convenient to query within a dataframe, and typically for a few thousand knn searches, the speed gain can already justify the convenience to use this in dataframes. If people truely want to do vector search on huge dataset, they should go for LanceDB and other more sophisticated ANN solutions that usually involves local storage and more complicated space partitions.
I think once you have the balltree data structure ready, I can make it available in dataframes as queries. Maybe only for haversine distance because most of the spatial queries can be done with kdtree already and kdtrees are somewhat cheaper.
My recent branch pykd (merged into main already) has an implementation of Kdtree that works with numpy input. You can find the Python wrapper in src/pymodels/py_kdt.rs and in python/polars_ds/kdtree.py. Strictly speaking we don't need the additional wrapper in kdtree.py, but that is a very thin layer and provides linter and additional documentation to users. All SciPy kdtree's methods are available there too, with some name changes. I think this is the direction you want to go.
Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.
Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.
And btw, unfortunately in order for Python Interop to work, you have to use OwnedLeaf because Python lifetime cannot be safely passed into Rust. So data has to be copied. Again, I don't think this is for large scale vector searches. So the copying is not a big concern.
This is a neat polars plug in. Thanks! Do you have plans for implementing the Ball Tree algorithm like in scikit-learn?
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html
The haversine distance metric to find nearest neighbors between latitude/longitude points on Earth would be very useful.