[MNT] Add joblib backend option and set default to all parallelized estimators

aeon-toolkit / aeon

A toolkit for machine learning from time series

https://aeon-toolkit.org/

BSD 3-Clause "New" or "Revised" License

955 stars 109 forks source link

[MNT] Add joblib backend option and set default to all parallelized estimators #1797

Open baraline opened 2 months ago

baraline commented 2 months ago

Describe the issue

Some of the estimators that use joblib for parallelization use process-based backend, while other use threads-based backend. Ideally, we want this to be a parameter tunable by the users.

Suggest a potential alternative/fix

Including a joblib_backend parameter which would default to threading (from discussions with Matthew), and use this parameter to set the joblib backend parameter during Parallel calls would fix the issue.

TonyBagnall commented 2 months ago

I remember we looked at this in some detail when debugging some rocket features, was there some interaction with its use of prange? I cant remember!


def _static_transform_uni(X, parameters, indices):
    """Transform a 2D collection of univariate time series.

    Implemented separately to the multivariate version for numba efficiency reasons.
    See issue #1778.
    """
    n_cases, n_timepoints = X.shape
    (
        _,
        _,
        dilations,
        n_features_per_dilation,
        biases,
    ) = parameters
    n_kernels = len(indices)
    n_dilations = len(dilations)
    f = n_kernels * np.sum(n_features_per_dilation)
    features = np.zeros((n_cases, f), dtype=np.float32)
    for i in prange(n_cases):

CodeLionX commented 1 month ago

I think adding this parameter to so many APIs would clutter the parameter lists. What are your thoughts on using joblib.parallel_config? This means that all aeon code does not specify any parallel backend in their calls to joblib.Parallel (if it does not require a certain backend) and the user can choose its backend by wrapping calls within:

with joblib.parallel_config(backend="loky"):
   # aeon code

The same is actually possible for n_jobs etc.

baraline commented 1 month ago

True, forgot about this option ! That or we set a global variable such as AEON_JOBLIB_BACKEND and document its usage

CodeLionX commented 1 month ago

A custom env var AEON_JOBLIB_BACKEND IMO just makes sense if it is also used for other parallel stuff besides joblib. Otherwise, we can use the existing (documented and potentially already known) joblib-facilities.

MatthewMiddlehurst commented 1 month ago

Sounds like a good idea, would have to be documented though. Not sure about removing n_jobs (not that it was really suggested), but would be good to tidy up the other bits.

baraline commented 1 month ago

I'm OK with the use of the parallel_config option and let user warp estimators if they want to change from the default. But if we want to default to threads, we need to do the following, which look like parallel_config doesn't overwrite :

with parallel_config(backend='loky'):
    p = Parallel(backend='threading')
    print(p._backend)

You obtain <joblib._parallel_backends.ThreadingBackend object at 0x000001D91BB731F0>

This means that if we want the parallel_config option to work, we would need to use default Parallel(), which use the loky backend. This is exactly the one we want to move away from for multiple reasons (check https://scikit-learn.fondation-inria.fr/joblib-sprint/ for some).

Or did I miss something ?

CodeLionX commented 1 month ago

ah, damn ... yes, Parallel() uses the loky-backend by default. Then, I don't see how parallel_config would help us here.

baraline commented 1 month ago

So for options we have :

Add a parameter to specify backend, which defaults to threads for all estimators with parallel capabilities
Add an environment variable to specify backend package wide.

With an equal level of documentation, I think the first options would allow us to be more flexible if for whatever reason we need to use processes for some estimators and threads for others. Also it should be less prone to causing issues with existing methods.

But true, this adds a bit of parameter bloat to the API... We could introduce a dictionary regrouping all joblib params and use it as kwargs when calling parallel ?

Other ideas are welcome !

CodeLionX commented 1 month ago

I'm also in favor of 1. How many joblib parameters do we have? If there are just 2, I would not use kwargs for that.