abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
263 stars 17 forks source link

Feature suggestion #127

Open 1112114641 opened 2 months ago

1112114641 commented 2 months ago

Hi,

had a quick look, and quite like the library. I have a suggestion to extend the library, specifically ols.rs / num.py - what my suggestion achieves / changes, is taking query_lstsq() from predicting the current value, to predicting pred_dist steps ahead. Moreover, it allows to, on the fly, change from linear-, quadratic-, ...., polynomial-level prediction using the order kwarg.

use polars::prelude::*;
use polars::{
  datatypes::DataType,
  error::PolarsResult,
  series::Series,
};
use pyo3_polars::derive::polars_expr;
use ndarray::{Array, Array2, Dim, ShapeError};
use serde::Deserialize;
use ndarray_linalg::LeastSquaresSvdInto;

#[derive(Deserialize, Debug)]
pub(crate) struct LstsqKwargs {
    pub(crate) order: u8,
    pub(crate) pred_dist: f64,
}

fn pred_output(_: &[Field]) -> PolarsResult<Field> {
    Ok(Field::new("pred", DataType::Float64))
}
// fn pred_coef_output(_: &[Field]) -> PolarsResult<Field> {
//   Ok(Field::new("coeffs",DataType::List(Box::new(DataType::Float64)),))
// }

#[inline(always)]
fn series_to_array1(series: &Series) -> Result<Array<f64, Dim<[usize; 1]>>, ShapeError> {
  let array = series.f64().unwrap();
  let y_data = array.into_no_null_iter().collect::<Vec<_>>();
  Ok(Array::from_vec(y_data))
}

#[inline(always)]
fn series_to_vandermonde_array2(series: &Series, degree: usize) -> PolarsResult<Array2<f64>> {
  let nrow = series.len();
  let tmp_arr = series.cast(&DataType::Float64)?;
  let array = tmp_arr.f64().unwrap();
  let data = array.into_no_null_iter()
      .map(|val| (0..degree).map(move |pow| val.powi(pow as i32)).collect::<Vec<_>>())
      .collect::<Vec<_>>();

  Ok(Array2::from_shape_vec((nrow, degree), data.into_iter().flatten().collect()).unwrap())
}

/// takes inputs w x at 0 y at 1, and returns a tuple w y at ncols
#[inline(always)]
fn mask_n_into_mat2(inputs: &[Series], order: usize) -> PolarsResult<(Array2<f64>, Array<f64, Dim<[usize; 1]>>)> {
  let mat = series_to_vandermonde_array2(&inputs[0], order).unwrap();
  let y = series_to_array1(&inputs[1]).unwrap();
  Ok((mat, y))
}

// #[polars_expr(output_type=Float64)]
#[polars_expr(output_type_func=pred_output)]
// #[polars_expr(output_type_func=pred_coef_output)]
pub fn lstsq_pred(inputs: &[Series], kwargs: Option<LstsqKwargs>) -> PolarsResult<Series> {
    let kwargs = kwargs.unwrap_or_else(|| LstsqKwargs {
        order: 1, // Default to linear if not specified
        pred_dist: 1.0,
    });

    let order = kwargs.order + 1; // Add 1 to order to account for the constant term

    match mask_n_into_mat2(inputs, order as usize) {
        Ok((mat, y)) => {

          let result = mat.least_squares_into(y).unwrap().solution;
          let x_max: f64 = inputs[0].cast(&DataType::Float64)?.max()?.unwrap();
          let x_pow_vec = (0..order).map(|i| (x_max + kwargs.pred_dist).powi(i as i32)).collect::<Vec<_>>();
          let pred = Array::from_vec(x_pow_vec).dot(&result);

          // debug helper
          // let mut coef_build: ListPrimitiveChunkedBuilder<Float64Type> = ListPrimitiveChunkedBuilder::new("coefs", 1, result.len(), DataType::Float64);
          // coef_build.append_slice(&result.iter().map(|&val| val).collect::<Vec<f64>>());
          // let out = coef_build.finish();
          // Ok(out.into_series())

          Ok(Series::new("pred",vec![pred]))
        },
        Err(e) => Err(e),

    }

}

&

def lstsq_pred(
  x: IntoExpr,
  y: IntoExpr,
  order: int,
  pred_dist: float,
) -> pl.Expr:
  """
  This is an aggregation, hence return_scalar/is_elementwise values.

  order = 1 -> linear

  order = 2 -> quadratic

  order = 3 -> cubic

  Args:
      x (IntoExpr): string or pl.Expr
      y (IntoExpr): string or pl.Expr
      order (int): linear = 1, quadratic = 2, cubic = 3, ...
      pred_dist (float): timesteps to predict into the future

  Returns:
      pl.Expr: predicted values
  """
  return register_plugin_function(
    args=[str_to_expr(x), str_to_expr(y)],
    is_elementwise=False,
    returns_scalar=True,
    function_name="lstsq_pred",
    plugin_path=Path(__file__).parent,
    kwargs={"order": order, "pred_dist": pred_dist},
  )

cheers, 1112114641

abstractqqq commented 2 months ago

Thanks for the request! Do you have a blog post or something I can read about? Reading dry code doesn't help me understand the topic very well.

1112114641 commented 2 months ago

Um, work in progress šŸ˜…

abstractqqq commented 2 months ago

Um, work in progress šŸ˜…

Let me know once it is more or less ready! Also I would like some references to the topics you are implementing. That would help me greatly and I can then start working on this too.