beringresearch / ivis

Dimensionality reduction in very large datasets using Siamese Networks
https://beringresearch.github.io/ivis/
Apache License 2.0
330 stars 43 forks source link

`KeyError` followed by `joblib.externals.loky.process_executor.BrokenProcessPool` when using `sklearn.model_selection.GridSearchCV` with `n_jobs != 1` #109

Closed imatheussm closed 2 years ago

imatheussm commented 2 years ago

The issue

Whenever using sklearn.model_selection.GridSearch with a sklearn.pipeline.Pipeline containing ivis.Ivis and GridSearch is set with n_jobs != -1 (e.g., n_jobs = 2), errors happen. When n_jobs = 1, no errors occur.

Minimal reproducible example

Environment

A virtual environment was created specifically for this project, wherein all modules specified in requirements.txt were installed. My setup runs an up-to-date version of Windows 10 (no WSL). My local repository was based on a50b196735eecc2afc63423fe99b803107048572.

Runtime

python=3.9.5

Relevant modules

ivis=2.0.6
tensorflow=2.6.0

Example with sklearn.pipeline.Pipeline

Script

import tempfile
import ivis

from sklearn import datasets, ensemble, model_selection, pipeline, preprocessing, svm

X, y = datasets.load_iris(return_X_y=True)

pipeline_with_ivis = pipeline.Pipeline([
    ("normalize", preprocessing.MinMaxScaler()),
    ("project", None),
    ("classify", None),
], memory=tempfile.mkdtemp())

parameter_grid = {
    "project": (ivis.Ivis(verbose=0),),
    "project__k": (15,),

    "classify": (ensemble.RandomForestClassifier(), svm.SVC()),
    "classify__random_state": (2021,)
}

grid_search = model_selection.GridSearchCV(pipeline_with_ivis, parameter_grid, scoring="accuracy", cv=10, verbose=3,
                                           return_train_score=True, error_score='raise', n_jobs=2).fit(X, y)

Log with errors

Fitting 10 folds for each of 2 candidates, totalling 20 fits
exception calling callback for <Future at 0x29b2a280ca0 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "<USER_FOLDER>\AppData\Local\Programs\Python\Python39\lib\multiprocessing\queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 211, in __setstate__
    if state[key] is not None:
KeyError: 'ivis_params_'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
ERROR: The process with PID 29140 (child process of PID 27848) could not be terminated.
Reason: There is no running instance of the task.
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "<USER_FOLDER>\AppData\Local\Programs\Python\Python39\lib\multiprocessing\queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "<REPOSITORY_ROOT>\ivis\ivis.py", line 211, in __setstate__
    if state[key] is not None:
KeyError: 'ivis_params_'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<USER_FOLDER>\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\212.5284.44\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "<USER_FOLDER>\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\212.5284.44\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "<REPOSITORY_ROOT>/playground.py", line 74, in <module>
    grid_search = model_selection.GridSearchCV(pipeline_with_ivis, parameter_grid, scoring="accuracy", cv=10, verbose=3,
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py", line 891, in fit
    self._run_search(evaluate_candidates)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py", line 1392, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\sklearn\model_selection\_search.py", line 838, in evaluate_candidates
    out = parallel(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 1054, in __call__
    self.retrieve()
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "<USER_FOLDER>\AppData\Local\Programs\Python\Python39\lib\concurrent\futures\_base.py", line 445, in result
    return self.__get_result()
  File "<USER_FOLDER>\AppData\Local\Programs\Python\Python39\lib\concurrent\futures\_base.py", line 390, in __get_result
    raise self._exception
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 177, in submit
    return super(_ReusablePoolExecutor, self).submit(
  File "<REPOSITORY_ROOT>\venv\lib\site-packages\joblib\externals\loky\process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Discussion

To the best of my knowledge and local testing, this issue seems to appear after the resolution of #101 with #106. Since the logic for pickling and unpickling Ivis instances was improved, my belief is that this particular use case was not covered neither by the new implementation, nor by the newly added test cases. I reserved some time today to investigate this further, but so far, I was out of luck. Do you have any thoughts on this? Perhaps we could also create additional tests covering scenarios wherein Pipeline and GridSearchCV are used. Does this make sense?

Thank you beforehand for your support.

Szubie commented 2 years ago

I think this happens because the ivis models are being pickled just after creation, before they are fitted - the current logic for save_model and load_model don't allow for saving models that aren't fitted. Probably cleanest way to fix is add that logic to those methods. Will have a closer look.

Szubie commented 2 years ago

Pushed a new branch https://github.com/beringresearch/ivis/tree/untrained-model-persistance that should solve this issue. Still need to add some tests to validate, then will merge and release fix.

Thanks for raising this issue, this is definitely a nice feature to have.

Szubie commented 2 years ago

Merged into master with https://github.com/beringresearch/ivis/commit/690d610d62be786b7a9a98debe57129bae678095

Thanks again for bringing this to our attention - slipped through the cracks in the refactor.