Blue-Yonder-OSS / cyclic-boosting

implementation of Cyclic Boosting machine learning algorithms
Eclipse Public License 2.0
86 stars 15 forks source link

make dependency on numba-scipy optional #21

Closed kjyv closed 1 year ago

kjyv commented 1 year ago

When importing the pipeline as in the sample, without calling any code I see this:

from cyclic_boosting.pipelines import pipeline_CBPoissonRegressor
  File "<...>/lib/python3.10/site-packages/cyclic_boosting/__init__.py", line 40, in <module>
    from cyclic_boosting.nbinom import CBNBinomC
  File "<...>/lib/python3.10/site-packages/cyclic_boosting/nbinom.py", line 9, in <module>
    import numba_scipy.special  # noqa
  File "<...>/lib/python3.10/site-packages/numba_scipy/special/__init__.py", line 1, in <module>
    from . import overloads as _overloads
  File "<...>/lib/python3.10/site-packages/numba_scipy/special/overloads.py", line 4, in <module>
    from . import signatures
  File "<...>/lib/python3.10/site-packages/numba_scipy/special/signatures.py", line 376, in <module>
    ('pdtr', numba.types.float64, numba.types.float64): ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double, ctypes.c_double)(get_cython_function_address('scipy.special.cython_special', '__pyx_fuse_0pdtr')),
  File "<...>/lib/python3.10/site-packages/numba/core/extending.py", line 466, in get_cython_function_address
    return _import_cython_function(module_name, function_name)
ValueError: No function '__pyx_fuse_0pdtr' found in __pyx_capi__ of 'scipy.special.cython_special'

Python 3.10.5, scipy 1.10.1, macOS 13.3.1

FelixWick commented 1 year ago

I assume this is due to a version mismatch between your installed numba-scipy and scipy. Can you try with the versions in https://github.com/Blue-Yonder-OSS/cyclic-boosting/blob/main/poetry.lock?

kjyv commented 1 year ago

It seems you're right, numba-scipy 0.3.1 is not compatible with the scipy version but now I'm forced to downgrade scipy to 1.7.3 which is also not compatible with the numpy and pandas versions we're using in the project I want to try this in. Why require such an old scipy version?

FelixWick commented 1 year ago

Because numba-scipy unfortunately requires scipy<=1.7.3

kjyv commented 1 year ago

Ah I see. Not sure about your internals, but maybe the numba and numba-scipy dependency can be made optional since numba itself also often lags a bit behind e.g. python updates. We had some issues with not being able to update mlflow because of that, so they're making some dependencies optional now.

(There is an issue and pull request about this https://github.com/numba/numba-scipy/issues/88)

FelixWick commented 1 year ago

Yeah, in fact it is blocking python 3.11 for us, at the moment. Let me check options. Thanks!

kjyv commented 1 year ago

This is still an issue, isn't it?

StefanUlbrich commented 1 year ago

Can we get rid of numba_scipy alltogether? We need it only to use gammaln, and maybe we could manually create an overload for that function. This post links to a project that does this, and for one single function, we could duplicate the code. I'll have a look at the code to guess the effort it takes.

Edit:

@nb.njit()
def nbinom_log_pmf(x: nb.float64, n: nb.float64, p: nb.float64) -> nb.float64:
    """
    Negative binomial log PMF.
    """
    coeff = sc.gammaln(n + x) - sc.gammaln(x + 1) - sc.gammaln(n)
    return coeff + n * np.log(p) + x * np.log(1 - p)

We need gammaln only for scalars. @FelixWick , can we assume $n$ and $x$ are positive integers (it says PMF)? Then we can just replace the code with some loops and observe the impact on performance?

2nd Edit:

I just created a draft for that; will continue working on that later https://github.com/Blue-Yonder-OSS/cyclic-boosting/pull/27

FelixWick commented 1 year ago

Yes, we use numba_scipy only for gammaln, and it is only needed in the negative binomial width mode. I’m fine with replacing it by an own implementation, if we make sure impact on performance is minor. To your question: x is a positive integer. n is positive, but float.

StefanUlbrich commented 1 year ago

Any other constraint that we a have for $n$? I added a notebook to the PR with at quite fast implementation of the Stirling approximation which doesn't seem to be too accurate for $n<2$. Can you have a look at the new notebook please and continue the discussion? The more elaborate Lanczos approximation is too complex and I could not get the second approximation mentioned here to run.

Edit: With a combination of Sterling and the other method, we can get ~1e-5 precision for gammaln for arbitrary values at slightly faster speed than scipy. Would that be sufficient?

Edit2: Continued analysis in the notebook–learned a lot (and even more than I wanted) about numerical approximations. If the precision is acceptable (@FelixWick can you write a test case for that?), we can use this method. I added a "calculator" at the end that allows for computing the values required for higher precision (will slow down computation for values < 2.0). Also we can add a test case that uses testit to make sure we don't regress performance compared to scikit learn.

FelixWick commented 1 year ago

Thanks! Will have a closer look at the notebook. No, no further constraints. The precision sounds ok. But yeah, let’s do some variations of the width model integration test and check the impact on c. And yes, please add a performance test. It sounds like we will be fine here, but this is also very useful in general.

FelixWick commented 1 year ago

After having a closer look, I'm in favor of an optional dependency on numba-scipy for users who need nbinom width (with high accuracy).