beringresearch / ivis

Dimensionality reduction in very large datasets using Siamese Networks
https://beringresearch.github.io/ivis/
Apache License 2.0
332 stars 43 forks source link

Ivis seems to provoke errors when composing a sklearn.pipeline.Pipeline passed to sklearn.model_selection.GridSearchCV and executed in parallel #96

Closed imatheussm closed 3 years ago

imatheussm commented 3 years ago

The problem

I noticed that when Ivis compose a sklearn.pipeline.Pipeline which is passed to sklearn.model_selection.GridSearch to fine-tune hyper-parameters across all estimators/transformers, and GridSearch has n_jobs=-1 (i.e., when executions within GridSearch are parallel), errors are thrown. This does not happen when n_jobs=1 (i.e., when the executions within GridSearch are sequential).

Since Pipeline globally regulates the n_jobs parameter, thus not supporting the parallelization of only specific steps, this problem forces the global use of n_jobs=1, which sensibly slows down the fine-tuning process by underusing the computational power of the setup in which the script is being executed (even in parts where n_jobs=-1 would work).

Environment

A virtual environment was created specifically to this repository, wherein all modules described in requirements.txt were installed. My setup runs an up-to-date version of Windows 10 (no WSL).

Runtime

python=3.8.4

Relevant modules

ivis=2.0.3
tensorflow=2.5.0

Minimal reproducible example

Code

if __name__ == "__main__":
    import tempfile
    import ivis

    from sklearn import datasets, ensemble, model_selection, pipeline, preprocessing
    from os import environ

    environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

    X, y = datasets.load_iris(return_X_y=True)

    pipeline_with_ivis = pipeline.Pipeline([
        ("normalize", preprocessing.MinMaxScaler()),
        ("project", ivis.Ivis()),
        ("classify", ensemble.RandomForestClassifier()),
    ], memory=tempfile.mkdtemp())

    parameter_grid = {
        "project__k": (15,),
        "project__verbose": (True,),

        "classify__random_state": (2021,)
    }

    grid_search = model_selection.GridSearchCV(pipeline_with_ivis, parameter_grid, scoring="accuracy", cv=10, n_jobs=-1,
                                               return_train_score=True, verbose=3).fit(X, y)

Error

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 212, in extract_knn
    process.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\process.py", line 39, in _Popen
    return Popen(process_obj)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\backend\popen_loky_win32.py", line 70, in __init__
    child_env.update(process_obj.env)
AttributeError: 'KnnWorker' object has no attribute 'env'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 591, in __call__
    return self._cached_call(args, kwargs)[0]
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 534, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\memory.py", line 761, in call
    output = self.func(*args, **kwargs)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 350, in fit_transform
    self.fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 328, in fit
    self._fit(X, Y, shuffle_mode)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 190, in _fit
    self.neighbour_matrix = AnnoyKnnMatrix.build(X, path=self.annoy_index_path,
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 63, in build
    return cls(index, X.shape, path, k, search_k, precompute, include_distances, verbose)
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 48, in __init__
    self.precomputed_neighbours = self.get_neighbour_indices()
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 96, in get_neighbour_indices
    return extract_knn(
  File "<REPOSITORY_ROOT>\ivis\data\neighbour_retrieval\knn.py", line 236, in extract_knn
    process.terminate()
  File "C:\Python38\lib\multiprocessing\process.py", line 133, in terminate
    self._popen.terminate()
AttributeError: 'NoneType' object has no attribute 'terminate'
  warnings.warn("Estimator fit failed. The score on this train-test"

[...]

<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan]
  warnings.warn(

Discussion

By coding and playing with the example above, I acquired the understanding that, since both sklearn uses joblib and ivis uses multiprocessing, these modules might not be playing well with each other for some reason.

I would discard the understanding that nested estimators/transformers with parallel routines would be the problem: estimators like sklearn.ensemble.RandomForestClassifier can be set to have n_jobs=-1 without problem within the Pipeline passed to GridSearchCV.

I am particularly affected by this issue because I want to employ ivis in projects that involve hyper-parameter fine-tuning using cross-validation via GridSearchCV with concurrent executions. I attempted to diagnose the problem, but to no avail, which is why I bring this issue to your attention.

Observation: another part of this problem is a design choice that is not adherent to the sklearn API guidelines, whose solution I propose and detail in #95. This issue does not cause the aforementioned error, but might cause other errors that could affect the same use scenario (Pipeline in GridSearchCV running in parallel).

idroz commented 3 years ago

Thanks very much for the example - I was able to reproduce the issue and looking into it.

Meanwhile, setting precompute=False in the Ivis constructur seems to do the trick:

pipeline_with_ivis = pipeline.Pipeline([
        ("normalize", preprocessing.MinMaxScaler()),
        ("project", ivis.Ivis(precompute=False)),
        ("classify", ensemble.RandomForestClassifier()),
    ], memory=tempfile.mkdtemp())

One thing to keep in mind is that passing the X and y pair into ivis will force it into supervised dimensionality reduction (https://bering-ivis.readthedocs.io/en/latest/supervised.html). If you want to disable the effect of supervision on ivis embeddings, you should set supervision_weight=0 in the constructor. This is a side-effect of scikit learn pipelines that propagate (X,y) pair to fit methods of all elements of the pipeline.

You could also configure it during your grid search - would be cool to see its impacts on the downstream classifier!

Let us know if this solves the issue.

Szubie commented 3 years ago

As luck would have it, I have been looking into removing the multiprocessing dependency in ivis in favor of threading recently. I've pushed a commit to master with the changes (https://github.com/beringresearch/ivis/commit/a666e75cbbacbe9a1fa0051fcb7508d67b2069d0). It seems to fix this issue when "precompute" is True.

imatheussm commented 3 years ago

Meanwhile, setting precompute=False in the Ivis constructur seems to do the trick:

Hmm, I confess it never occurred to me to play with this constructor parameter. I will test out your suggestion later and see how it works out on my end as a palliative measure.

One thing to keep in mind is that passing the X and y pair into ivis will force it into supervised dimensionality reduction (https://bering-ivis.readthedocs.io/en/latest/supervised.html). If you want to disable the effect of supervision on ivis embeddings, you should set supervision_weight=0 in the constructor.

I am aware of that, and for my particular use case the use of supervised DR is intentional. Still, it is nice to know that there is a way to make Ivis ignore the labels without affecting the API (which would affect, consequently, Pipeline and GridSearchCV).

As luck would have it, I have been looking into removing the multiprocessing dependency in ivis in favor of threading recently. I've pushed a commit to master with the changes (a666e75). It seems to fix this issue when "precompute" is True.

Nice! I will test Ivis with your commit later as well and see how it works out on my end.

imatheussm commented 3 years ago

Just a quick update: I am still testing ivis on the minimal reproducible example, as well as on a pipeline I have been working on. I still managed to find some errors, but they seem to happen just when I am running GridSearchCV with n_jobs=-1 inside a docker container. I am just ascertaining that this is a docker problem, and not a ivis one.

If it serves of anything, here is the error I have been seeing under docker. It seems to happen whenever I run GridSearchCV with n_jobs != 1. It runs for some time without any problems, and then this happens:

free(): invalid pointer
exception calling callback for <Future at 0x7f907840c220 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/venv/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGABRT(-6)}

I tried to find more information on this on the web, but the only thing I was able to find that resembles this problem was this unanswered question on StackOverflow. It seems to happen whenever the Pipeline includes ivis, and it does not seem to happen with other projectors (e.g., UMAP, PCA) on the few tests I made, which makes me wonder if ivis plays a part on this. If pertinent, I will produce another minimal reproducible example involving docker for you to try and reproduce this error on your end.

As I said, I am still running tests, so take everything I said above with a pinch of salt. And before I forget, thank you for the diligence with which you assisted me in solving this issue. I really appreciate it.

imatheussm commented 3 years ago

After some testing, I have established that

I managed to reproduce this problem using the same code above, but using a docker image built from the following Dockerfile:

# ----------------------------------- #
# BUILD IMAGE STAGE                   #
# ----------------------------------- #
FROM python:3.9.5-slim as build-image
# ----------------------------------- #

# install required binaries
RUN apt-get update \
&& apt-get install --no-install-recommends -y build-essential git

# create python virtual environment and upgrade pip
RUN python3 -m venv /opt/venv \
&& /opt/venv/bin/python3 -m pip install --upgrade pip --no-cache-dir

# use created python virtual environment
ENV PATH="/opt/venv/bin:$PATH"

# install wheel and cmake
RUN pip install wheel --no-cache-dir \
&& pip install cmake --no-cache-dir

# copy requirements.txt to container
COPY requirements.txt .

# install required python modules
RUN pip install -r requirements.txt --no-cache-dir

# ---------------------------------------- #
# PRODUCTION IMAGE STAGE                   #
# ---------------------------------------- #
FROM python:3.9.5-slim as production-image
# ---------------------------------------- #

# copy previously created python virtual environment over
COPY --from=build-image /opt/venv /opt/venv

# use copied python virtual environment
ENV PATH="/opt/venv/bin:$PATH"

# persist files
ADD . .

I do not know if this is a problem within ivis, because it runs natively without problems. My current belief is that this is happening by some kind of OS protection (either Windows or the Linux distro within the container) that is killing some processes.

idroz commented 3 years ago

Can I just clarify if you're being this issue whilst running Ivis inside Docker, or does it also throw this error when running natively on Windows?

imatheussm commented 3 years ago

So far, it only happened inside docker. It runs for some seconds and gets terminated with this SIGABRT error being shown.

Running natively on Windows (i.e., without docker), no matter how heavy or long the script is, it runs without issues.

Szubie commented 3 years ago

Hi, this appears to be an issue when using n_jobs=-1 for some scikit-learn objects. For some discussion on similar issues see this issue: https://github.com/scikit-learn-contrib/skope-rules/issues/18

I changed n_jobs from -1 to -2 as suggested in that thread and it fixed the issue for me at least on the basic iris example you provided above. Not sure if the same fix will work on your machine as well, but worth a try.

This may also be useful, regarding someone running into issues when nesting multiple n_parallel=-1 arguments within scikit-learn pipelines: https://stackoverflow.com/questions/60782660/issues-with-multiple-jobs-when-using-randomizedsearchcv (might be best to avoid).

Oh yeah, and make sure the Docker container has enough memory provided to it to run the task. If on Windows, the default is quite low. https://stackoverflow.com/questions/43460770/docker-windows-container-memory-limit

idroz commented 3 years ago

Echo the above point around docker and RAM. I've seen a similar error in docker when container runs out of allocated RAM. GridSearchCV by default will copy data across all processes, causing a RAM explosion (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Reducing pre_dispatch (default value is n_jobs) can help, but giving docker more RAM does the trick, at least for me.

imatheussm commented 3 years ago

Hmm, it never occurred to me that this could be memory. Thank you for the clarification on this matter and for solving the issue with GridSearchCV, I really appreciate it. Feel free to close this issue if there is nothing else to be added.