elephaint / pgbm

Probabilistic Gradient Boosting Machines
Apache License 2.0
141 stars 20 forks source link

How to pull the parameters (mean and standard deviation) of the distribution fitted? #3

Closed flippercy closed 3 years ago

flippercy commented 3 years ago

Hi:

Thank you for the awesome library! I did some tests with it and have a few questions:

  1. How to pull the parameters, such as mean and standard deviation, of the final fitted distribution for each leaf? Such information is extremely helpful when the result is presented and explained to stakeholders. Currently the model just returns some numbers sampled from the distribution but business users are likely to focus on the distribution itself.

  2. Is there anyway to spit out the model's tree structure to a data frame like what get_dump() does for xgboost?

Thank you!

elephaint commented 3 years ago

Hi,

Thanks for the kind words, much appreciated!

  1. That's a good remark, and something I was planning on making available too. It's currently not possible to easily extract those, but it should be an easy fix. I'll have a look at it in the coming days and I expect to make it available either through a separate function or by adding a parameter to the predict / predict_dist functions.
  2. I'm not familiar with the get_dump() from xgboost, but the 'model.save' method saves all the required arrays for the tree. In particular, there are six arrays of importance for the tree structure:

Note 1) max_nodes = max_leaves - 1 Note 2) For prediction, also the arrays 'bins' ([n_features x n_bins] float32) and the initial prediction (yhat_0, float32 scalar) are required. These are saved too when calling 'model.save'.

Hope this helps! I'll have a look at implementing 1) in the coming days.

Kind regards,

Olivier

flippercy commented 3 years ago

@elephaint Thank you for the quick turnaround! Looking forward to the new features in the upgrade.

elephaint commented 3 years ago

Hi,

  1. This is now available as part of the predict_dist function. If you specify yhat_dist = predict_dist(X_test, output_sample_statistics=True), the function will return a tuple (forecasts, mean, variance) with the latter two the learned mean and variance per sample, which can subsequently be used to specify a distribution of your choice.

Hope this helps, let me know,

Olivier

flippercy commented 3 years ago

Thank you Olivier. Appreciate your help!

elephaint commented 3 years ago

Great, happy to help!

flippercy commented 3 years ago

Hi @elephaint:

Have the upgrades been implemented yet? I've upgraded my library to 1.0 but seen no change.

yhat_dist_pgbm = model.predict_dist(data_val_B_X, n_forecasts=100, output_sample_statistics=True)

_TypeError: predict_dist() got an unexpected keyword argument 'output_samplestatistics'

print(inspect.getargspec(model.predict_dist))

_ArgSpec(args=['self', 'X', 'nforecasts', 'parallel'], varargs=None, keywords=None, defaults=(100, True))

print(PGBM._version)

1

Thank you!

elephaint commented 3 years ago

Hi,

How unfortunate, that is very strange. The predict_dist function should support an output_sample_statistics keyword argument. In both the PyTorch and Numba version it is available (I've rechecked the source code and it should be in there...)

If you do pip list in the (virtual) Python environment where you installed PGBM, what version of PGBM is listed?

Did you make sure to force the upgrade, for example by doing pip install pgbm --force-reinstall?

flippercy commented 3 years ago

Thank you for the quick turnaround. That's what I got after running pip install pgbm --force-reinstall :

image

image

So it turns out that the new argument is still missing.

elephaint commented 3 years ago

Strange and frustrating! I've (i) (re)installed from PyPi in a new virtual environment, (ii) installed on a different pc with a new Python virtual environment, (ii) run in Google Colab and I still can't reproduce the issue. I've tested on Windows 10 (desktop), MacOS (MacBook Pro) and Linux (Colab), and for both Numba and Torch versions.

Can you run this example in Google Colab? In that example, you should be able to call the predict_dist function with the output_sample_statistics argument, i.e. output = model.predict_dist(X, output_sample_statistics=True)

I'm trying to think of what could go wrong... are you certain !pip list is executed from the same (virtual) environment as in which you execute PGBM? What kind of (Python) setup are you running? It feels as if there is a cached version somewhere left in the environment that apparently the code calls when executing the inspect calls.

flippercy commented 3 years ago

It is weird. I reinstalled JupyterHub on my linux server and it works now. Not sure what happened.

Thank you very much for your help!

elephaint commented 3 years ago

Great! Python's package manager is a mystery sometimes....