abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
379 stars 26 forks source link

Implemented Ball Tree using Haversine distance as mentioned in #237 #263

Closed AtlasPilotPuppy closed 1 month ago

AtlasPilotPuppy commented 1 month ago

Resolves #237 This PR implements Ball Tree with Haversine and Euclidian distance.

Thie implmentation does not utilize the SpatialQueries Trait. But can be modified after some discussion on how to handle the differeing parameters This does implement all the function utilized in the query_knn.py There is demo code at the bottom of basics.ipynb

There are also some Rust tests to test the functioning of the ball tree. This ball tree was influenced by this repo.

The current approach was chosen since it creates a ball tree and remain in memory for iterative querying. This is very efficient in our case since these are applied to all rows of the dataframe. This is my first time writing a Polars extension so ideas and feedback is welcome.

abstractqqq commented 1 month ago

@AtlasPilotPuppy The package compiles and I am merging it, but I will make some changes. e.g. move the file for better organization, reuse some exsisting code, etc.. Thanks a lot for the work.

AtlasPilotPuppy commented 1 month ago

@abstractqqq much appreciated. Let me know if there are other issues you would like to look at. I wanted to reuse more but my lack of familiarity was a hindrance. Another thing is that the Leaf doesn't own the points array. Which leads to some interesting challenges. It may be something else to consider to have the leaf take the ownership of the points rather than the containing function.

abstractqqq commented 1 month ago

@abstractqqq much appreciated. Let me know if there are other issues you would like to look at. I wanted to reuse more but my lack of familiarity was a hindrance. Another thing is that the Leaf doesn't own the points array. Which leads to some interesting challenges. It may be something else to consider to have the leaf take the ownership of the points rather than the containing function.

There is a OwnedLeaf type FYI. For KNN style query in dataframe, I use the regular Leaf type because I don't need to copy data any more. Data is already copied from df to a row major slice. For the py_kdt model I expose to Python (in src/pymodels.rs), I use the OwnedLeaf type because when data is passed from Python to Rust, life time info is obscure. I think in general the lifetime system forces you to in-line some functions which you would normally factor out. This forces one to make only "really good factors" and in-line those that are only for convenience.. It has its pros and cons.

Thank you again and this is great!