azmyrajab / polars_ols

Polars least squares extension - enables fast linear model polar expressions
MIT License
80 stars 6 forks source link

Feature Request: Output residual statistics and evaluation measures #20

Open stout-yeoman opened 2 months ago

stout-yeoman commented 2 months ago

Feature Description

I propose the integration of residual statistics and evaluation measures for regression models in the polars-ols package. This feature would include metrics such as R-squared, RMSE (Root Mean Square Error), MAE (Mean Absolute Error), as well as residual statistics such as skew and kurtosis, essential tools for diagnosing the fit of regression models.

Motivation

Currently, after fitting a regression model using polars-ols, users need to manually calculate performance metrics or use external libraries to evaluate model fit. Especially in the case of rolling ols, this becomes extremely inefficient to do after the coefficients have already been calculated. Integrating these metrics directly into polars-ols would streamline the analysis process, making it more efficient and accessible, especially for new users or those integrating polars into their data science workflows.

Suggested Implementation

Similarly to coefficients, residual statistics and evaluation measures can be output as structs. User's could indicate which of these they want included, by passing a list of "modes" or flag arguments.

Benefits

This feature would significantly enhance the utility of polars-ols by providing critical tools for model validation and diagnostics directly within the library. It would improve user experience, reduce dependency on other libraries for model evaluation, and could increase adoption of polars-ols in academic and professional settings.

Happy to discuss this proposal further and contribute to testing.

azmyrajab commented 2 months ago

Hi @stout-yeoman, thanks for the feature suggestion - think adding some main statistics metrics like these makes sense.

We can add residual statistics like R2 / RMSE / etc and also add core statistics on features like t-values + standard errors / confidence bands which might be handy too. The output would be a struct with descriptive field names and either float or list values depending on if it’s a residual or feature statistic (so similar to what you hinted at).

This shouldn’t be too bad to write with the caveat that for L1 and non-negative models the statistics produced would need to be taken as approximate as they’d violate some assumptions which a straightforward OLS/WLS/Ridge implementation of closed form statistics makes. Can perhaps log a warning for those. For rolling models inverse(XTX) needs to be updated per time step which can be done efficiently but would require a little more effort. perhaps a good starting point is to implement for non moving window models first, then update the rolling/recursive models next (or just implement residual statistics only for those).

I expect to be bandwidth constrained over the next days, but I will try to have a go at this feature request when possible and share with you a draft pull request which you could opine on / test locally