abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
390 stars 27 forks source link

Include Additional Rust-based Features/Functionalities #11

Closed abstractqqq closed 7 months ago

abstractqqq commented 1 year ago

Should this package include features/functionalities that may not be directly related to Polars Extensions? Feel free to leave a comment.

E.g. functions that take in a Polars DataFrame, and spits out the SVD decomposition or data related to PCA. Functions that return eigenvalues, etc.

The main issue is that I have two polars-based packages for data science: polars-ds (alpha) and dsds (pre-alpha). DSDS provides data screening, data problem detection, feature selection, and transformers. On the other hand polars-ds right now only provides Polars extensions. Right now both dsds and polars-ds have Rust modules. This leads to problems like code duplication and makes Rust code harder to share between the two packages.

Pro:

  1. Keep more foundational algorithms at one place. Smaller binary than having rust modules in two packages.
  2. To reach escape velocity in the data science world, we probably need lots of features bundled together and the features/functionality may or may not be related.
  3. The other package, dsds, then do not really need a rust backend any more. Just import from polars-ds. DSDS can purely focus on data front-end. DSDS then will only include things achievable by doing simple Polars queries, non-time-critical functions, and all Rust code will be in polars-ds and exported to DSDS. Polars-ds will focus more on performance and DSDS will focus more on UX. A better separation of priority.
  4. Easier to access Rust functions, because then Rust functions will all come from the same module. Let's say we have an expression that does nearest neighbor search:
    pl.col("id").num_ext.knn(pl.col("x1"), pl.col("x2"), k=3, metric="euclidean") 
    # Search for 3 nearest neighbors for each id, based on Euclidean distance using x1 and x2, and return the ids as a list. 

    Such an algorithm will involve a kdtree. But the same kdtree algorithm can be used for other tasks as well, which may not be DataFrame based. If we keep two Rust modules for polars-ds and dsds, then we will have to repeat the kd-tree implementations for all the use cases.

Cons:

  1. Longer building time for this package.
  2. This package then will no longer have a single focus.
  3. A lot of refactor effort (Help will definitely be appreciated..)
minghao51 commented 9 months ago

I couldnt find the repo for dsds (pre-alpha). Although, I think most people will be focus on the data front-end (on the level of sklearn etc) more instead of the foundational algorithm, so ...

+1 for merging.

I could try to help a bit in refactoring too, although I have been much on the side of python-polars only.

abstractqqq commented 9 months ago

I couldnt find the repo for dsds (pre-alpha). Although, I think most people will be focus on the data front-end (on the level of sklearn etc) more instead of the foundational algorithm, so ...

+1 for merging.

I could try to help a bit in refactoring too, although I have been much on the side of python-polars only.

DSDS is one of my older repos. I do not plan to maintain it any longer. Its functionalities will be merged to polars-ds bit by bit. My plan is to start doing it after v.4.0. I think I know how to write almost all traditional ML transformers in pure polars, and I have the infrastructure ready to support Polars-native dependency detection, data cleaning, data drifting, and other EDA tasks..