abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
315 stars 21 forks source link

polars-learn #202

Open firmai opened 1 month ago

firmai commented 1 month ago

Always funny how these projects start, one goes from like 100 users that understand the tremendous opportunity of fast in-memory computing --- and then 2-3 years later 10 million people heavily rely on your solutions.

There is a lot of potential in your project, it lacks a name, clarity (too much text on readme), marketing, and objectives. I think polars extensions and libraries built ontop of polars is the future of data science for the next 10 years.

Similar to rust-ml/linfa, I think you should aim for something much more grand. Why not search active open-source sponsorship, corporate or otherwise.

There is currently no polars sklearn. How could this be?. It is currently the most obvious missing link in data science. It is for a DS geek a wheeled suitcase moment - "We put wheels on bags after we put a man on the moon".

We need clustering, we need dimensionality reduction, feature selection, pairwise interactions, we need a polars machine learning and data-science project worth throwing yourself behind. Everybody is so busy creating LLMs to generate chicken casserol recipes, nobody is doing any actual work.

What are the hundreds and thousands of data scientists doing in their free time? Thanks for picking up the slack, you will be tomorrow's heros.

abstractqqq commented 1 month ago

Hey thank you for the interest!

This is actually my post in our public discord:

Some random rambles :

I am currently integrating my own kdtree into the package, instead of using the a third party rust crate.  Took me 2 days to fix a segfault bug which happens in very rare cases.... 

The kdtree re-impl is taking very long. But I want to do it because Kdtrees are used in all kinds of exact KNN searches, and entropy calculations. The reason I want to develop my own implementation is simple: better speed + easier to extend + I can learn how to run approximate vector search along the process, which can be hugely useful. I am now also very open to the idea of providing basic models outside Polars DataFrames, which work on any 2d  matrices. If we can make vector search in higher dimensions quick (approximate method, in-memory use case), oh wow the implication will be huge.. All  that fun stuff keeps my mind bouncing...

Since I will be making models now, this opens up tons of new items to work on. For starters, some users have asked me to implement variations of linear regression, e.g. ridge, l2 regularized, etc., and a rolling regression engine. I am aware this is extremely useful for the econometrics and finance folks. I am still a little torn on this because polars_ols provides exactly this. The real problem is whether we should consolidate some efforts within the polars plugin community, which is a harder topic to pursue. 

K-means has always been on my list but I haven't had enough time to implement.  Actually, in my limited experience, I had better results with k-medoids which is a similar method and which doesn't rely on properties of Euclidean geometry. There is an excellent Rust crate on that. 

The more I keep imagining the future the more I cannot fall asleep.. The future seems bright, but there are a lot of uncertainties in life too... 

Regarding marketing, I have tried some linkedin posts in earliers in the year and got some stars. I had terrible experience with reddit and X before, and thus I am not actively promoting the project on those platforms. I used to be more blunt and like using strong words. When I voice some of my unconventional opinions on programming, e.g. OOP not good for scientific computing, I got some terrible comments. Still learning how to navigate the online tech world.

With current Polars-ds, you can actually do MRMR feature selection with many options for correlation (4 different correlations readily available in the package rn). I cannot publish it because it is used in my company. But if you look up MRMR feature selection online, the logic isn't hard to implement.

Traditional ML pipelines are also available. But it strictly only applies to data transformations before being consumed by a model. Personally I think those two steps should be separate. You can find more in examples/.

PCA is also available in the package if you are not aware. See query_pca and query_principal_components.

Still a lot of work to do. Right now the focus is on regression. I am implementing Lasso with coordinate descent. Let's see how that one goes.

:)

firmai commented 1 month ago

Hopefully you have seen this solution https://github.com/azmyrajab/polars_ols

You should get your company onboard OS packages are becoming the best marketing for great talent acquisition

abstractqqq commented 1 month ago

Yep I am fully aware of polars_ols. The main thing is that I do not want to introduce dependency on third party blas or Lapack distributions.. It's a nightmare configuring all that. I am betting on Faer-rs, which is an alternative to those old C/Fortran linear algebra libraries. I take dependency very seriously. You can read up on how SciPy almost couldn't compile for Python 3.12 because of old Fortran dependency. So far for linear regression and SVD, speed is on par and even better in some cases and the author seems to be very knowledgeable.

abstractqqq commented 1 month ago

@firmai By the way, I am an NYU alumni :)

firmai commented 1 month ago

functime developers don't seem that interested in the pacakge, they still haven't updated for polars 1, don't you think it is better to asborb it into your package https://github.com/functime-org/functime/issues/250

abstractqqq commented 1 month ago

functime developers don't seem that interested in the pacakge, they still haven't updated for polars 1, don't you think it is better to asborb it into your package https://github.com/functime-org/functime/issues/250

Yes and no. I have been adding tsfresh style features slowly. Currently they are scattered around in num.py and stats.py and I haven't consolidated the features. Functime did a huge project of rewriting most Tsfresh features and I was part of the project, and I did a lot of performance testing and wrote more optimal queries for a lot of the features... So yes, I can take care of the feature extraction easily. I have more than half of what tsfresh and functime offers. And yes again the low recognition is because I am not actively marketing.. Sigh...

Although I like functional programming, I find it hard to track states in Functimes's transforms and I am increasingly feeling that classes are fine as long as they are shallow and serve one focus..

Time series transforms can be very different from traditional tabular ML transforms and I do not do time series a lot in my work and I am mostly learning along the way so it will take some time for me. I can go and ask the maintainers about the future of functime..

firmai commented 3 weeks ago

Yep I am fully aware of polars_ols. The main thing is that I do not want to introduce dependency on third party blas or Lapack distributions.. It's a nightmare configuring all that. I am betting on Faer-rs, which is an alternative to those old C/Fortran linear algebra libraries. I take dependency very seriously. You can read up on how SciPy almost couldn't compile for Python 3.12 because of old Fortran dependency. So far for linear regression and SVD, speed is on par and even better in some cases and the author seems to be very knowledgeable.

how is your OLS coming along?

abstractqqq commented 3 weeks ago

Ridge, Lasso, Rolling and Recursive were added in v0.5.1 (a bugged version of Ridge was introduced in v0.5.0). I have made some changes and improvements to all of these since the release. You can find them in the docs here:

https://polars-ds-extension.readthedocs.io/en/latest/num.html#polars_ds.num.query_rolling_lstsq

More null_policies will be good. But that can be tricky...

We also developed some benchmarks vs. sklearn. These are not strictly apples-to-apples, because the default "solver" may not be the same. For Lasso, I am also "not minding the dual gap" at this moment because I do not understand it well enough. (Also practically, I think it is enough to stop coordinate descent when the updates are small). Anyways, using the default, we have some good numbers:

image

Standalone modules for rolling and recursive is hard, because of the lack of interop support between NumPy and Faer-rs..

abstractqqq commented 2 days ago

Turns out standalone linear regression is EASY. A regular LR class (linear regression) and an Online LR class has been implemented.

I also added weighted lstsq as an option in query_lstsq, and linear regression with rcond.

A new ver will be released this coming weekend and I would like to take a break from linear regression.. Next is likely k-means, standalone kdtree and the ball-tree algorithm.