StatMixedML / XGBoostLSS

An extension of XGBoost to probabilistic modelling
https://statmixedml.github.io/XGBoostLSS/
Apache License 2.0
551 stars 58 forks source link

Reducing `install_requires` to minimum (& expand `extras_require`) and looser version ranges #56

Open gmgeorg opened 1 year ago

gmgeorg commented 1 year ago

At quick glance it seems that the current setup.py file is fully exhaustive on all dependencies as an absolute requirement including very specific ranges for versions. If at all possible, it would greatly improve compatibility w/ existing Python repos (and more people being able to use it w/o having to resolve conflicts) if the install_requires was only specifiying absolutely required modules (e.g., plotting or optuna is not really required to use this great package) and the minimum version date (>=) needed, instead of the (approximate) ~= range.

See also first accepted answer here: https://stackoverflow.com/questions/6947988/when-to-use-pip-requirements-file-versus-install-requires-in-setup-py

Curious to learn more about whether this package has to be so specific/restrictive on the dependencies (e.g., suggest to use https://stackoverflow.com/questions/10572603/specifying-optional-dependencies-in-pypi-python-setup-py

StatMixedML commented 1 year ago

Thanks for your suggestion.

As of now, model.py among others also imports optuna and the plotting packages. So all the requirements in the setup.py are required. I go thorugh your provided list of references. Thanks

gmgeorg commented 1 year ago

Here is a better reference to suggest a pip install xgboostlss (minimum requirements) vs pip install xgboostlss[full] or sthg along those lines

https://stackoverflow.com/questions/6237946/optional-dependencies-in-distutils-pip

I looked through the code and yes it is currently the case that the implementation requires optuna, seaborn, matplotlib, pandas, shap (I think those are all the ones I found which are not critical for core functionality of XGBoostLSS.train() / .predict()). My suggestion was to decouple this in order to avoid the requirement. In particular, making plotting/tuning methods (!) of the core class unnecessarily ties them to the core functionality; if instead tuning and plotting would be function based that take an XGBoostLSS object as input, then you can decouple them.

This will also have the benefit that whenever you add nicer/better/more visualizations, users can just update the package and call the new functions on an existing object (loaded from disk for example), and don't have to re-initialize a new class, train / tune it again, just so they can use a new plotting funcationality.

Basically proposal would be sthg along the lines of

This would accomplish the outcome that the core functionality in xgboostlss -- which AFAIU is training & predicting -- only depends on absolute minimum requirements; but any addon helper functions (like tuning or plotting) you provide are optional for installation/use. For handling the case when dependencies are not available you can then do sthg like here to let users know they need to install xgboostlss[full] to be able to access the tuning/plotting functionality.

fkiraly commented 8 months ago

I would agree with you, @gmgeorg.

Wider version ranges ensure usability, whereas frequent bumps of lower bounds prevent compatibility with a wider range of systems.

Generally, I would strongly advise the same, separating out the tuning from the raw model - like in skpro where you have probabilistic tuners separate from the "atomic" regression models.

I would be happy to help with a little refactoring of this package, possibly copying some of the more general visualization and tuning logic to skpro? If @StatMixedML is ok with this, would you be up for working on this together, @gmgeorg?