Closed imatheussm closed 3 years ago
Thanks very much for the example - I was able to reproduce the issue and looking into it.
Meanwhile, setting precompute=False
in the Ivis constructur seems to do the trick:
pipeline_with_ivis = pipeline.Pipeline([
("normalize", preprocessing.MinMaxScaler()),
("project", ivis.Ivis(precompute=False)),
("classify", ensemble.RandomForestClassifier()),
], memory=tempfile.mkdtemp())
One thing to keep in mind is that passing the X and y pair into ivis will force it into supervised dimensionality reduction (https://bering-ivis.readthedocs.io/en/latest/supervised.html). If you want to disable the effect of supervision on ivis embeddings, you should set supervision_weight=0
in the constructor. This is a side-effect of scikit learn pipelines that propagate (X,y) pair to fit methods of all elements of the pipeline.
You could also configure it during your grid search - would be cool to see its impacts on the downstream classifier!
Let us know if this solves the issue.
As luck would have it, I have been looking into removing the multiprocessing dependency in ivis in favor of threading recently. I've pushed a commit to master with the changes (https://github.com/beringresearch/ivis/commit/a666e75cbbacbe9a1fa0051fcb7508d67b2069d0). It seems to fix this issue when "precompute" is True.
Meanwhile, setting precompute=False in the Ivis constructur seems to do the trick:
Hmm, I confess it never occurred to me to play with this constructor parameter. I will test out your suggestion later and see how it works out on my end as a palliative measure.
One thing to keep in mind is that passing the X and y pair into ivis will force it into supervised dimensionality reduction (https://bering-ivis.readthedocs.io/en/latest/supervised.html). If you want to disable the effect of supervision on ivis embeddings, you should set
supervision_weight=0
in the constructor.
I am aware of that, and for my particular use case the use of supervised DR is intentional. Still, it is nice to know that there is a way to make Ivis
ignore the labels without affecting the API (which would affect, consequently, Pipeline
and GridSearchCV
).
As luck would have it, I have been looking into removing the multiprocessing dependency in ivis in favor of threading recently. I've pushed a commit to master with the changes (a666e75). It seems to fix this issue when "precompute" is True.
Nice! I will test Ivis
with your commit later as well and see how it works out on my end.
Just a quick update: I am still testing ivis
on the minimal reproducible example, as well as on a pipeline I have been working on. I still managed to find some errors, but they seem to happen just when I am running GridSearchCV
with n_jobs=-1
inside a docker
container. I am just ascertaining that this is a docker
problem, and not a ivis
one.
If it serves of anything, here is the error I have been seeing under docker
. It seems to happen whenever I run GridSearchCV
with n_jobs != 1
. It runs for some time without any problems, and then this happens:
free(): invalid pointer
exception calling callback for <Future at 0x7f907840c220 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 359, in __call__
self.parallel.dispatch_next()
File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 792, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "/opt/venv/lib/python3.9/site-packages/joblib/parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/opt/venv/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/reusable_executor.py", line 177, in submit
return super(_ReusablePoolExecutor, self).submit(
File "/opt/venv/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGABRT(-6)}
I tried to find more information on this on the web, but the only thing I was able to find that resembles this problem was this unanswered question on StackOverflow. It seems to happen whenever the Pipeline
includes ivis
, and it does not seem to happen with other projectors (e.g., UMAP, PCA) on the few tests I made, which makes me wonder if ivis
plays a part on this. If pertinent, I will produce another minimal reproducible example involving docker
for you to try and reproduce this error on your end.
As I said, I am still running tests, so take everything I said above with a pinch of salt. And before I forget, thank you for the diligence with which you assisted me in solving this issue. I really appreciate it.
After some testing, I have established that
ivis.Ivis
works fine natively (i.e., without docker
), but running the same script within docker
tends to raise the aforementioned error; andI managed to reproduce this problem using the same code above, but using a docker
image built from the following Dockerfile
:
# ----------------------------------- #
# BUILD IMAGE STAGE #
# ----------------------------------- #
FROM python:3.9.5-slim as build-image
# ----------------------------------- #
# install required binaries
RUN apt-get update \
&& apt-get install --no-install-recommends -y build-essential git
# create python virtual environment and upgrade pip
RUN python3 -m venv /opt/venv \
&& /opt/venv/bin/python3 -m pip install --upgrade pip --no-cache-dir
# use created python virtual environment
ENV PATH="/opt/venv/bin:$PATH"
# install wheel and cmake
RUN pip install wheel --no-cache-dir \
&& pip install cmake --no-cache-dir
# copy requirements.txt to container
COPY requirements.txt .
# install required python modules
RUN pip install -r requirements.txt --no-cache-dir
# ---------------------------------------- #
# PRODUCTION IMAGE STAGE #
# ---------------------------------------- #
FROM python:3.9.5-slim as production-image
# ---------------------------------------- #
# copy previously created python virtual environment over
COPY --from=build-image /opt/venv /opt/venv
# use copied python virtual environment
ENV PATH="/opt/venv/bin:$PATH"
# persist files
ADD . .
I do not know if this is a problem within ivis
, because it runs natively without problems. My current belief is that this is happening by some kind of OS protection (either Windows or the Linux distro within the container) that is killing some processes.
Can I just clarify if you're being this issue whilst running Ivis inside Docker, or does it also throw this error when running natively on Windows?
So far, it only happened inside docker
. It runs for some seconds and gets terminated with this SIGABRT
error being shown.
Running natively on Windows (i.e., without docker
), no matter how heavy or long the script is, it runs without issues.
Hi, this appears to be an issue when using n_jobs=-1 for some scikit-learn objects. For some discussion on similar issues see this issue: https://github.com/scikit-learn-contrib/skope-rules/issues/18
I changed n_jobs from -1 to -2 as suggested in that thread and it fixed the issue for me at least on the basic iris example you provided above. Not sure if the same fix will work on your machine as well, but worth a try.
This may also be useful, regarding someone running into issues when nesting multiple n_parallel=-1
arguments within scikit-learn pipelines: https://stackoverflow.com/questions/60782660/issues-with-multiple-jobs-when-using-randomizedsearchcv (might be best to avoid).
Oh yeah, and make sure the Docker container has enough memory provided to it to run the task. If on Windows, the default is quite low. https://stackoverflow.com/questions/43460770/docker-windows-container-memory-limit
Echo the above point around docker and RAM. I've seen a similar error in docker when container runs out of allocated RAM. GridSearchCV by default will copy data across all processes, causing a RAM explosion (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Reducing pre_dispatch
(default value is n_jobs) can help, but giving docker more RAM does the trick, at least for me.
Hmm, it never occurred to me that this could be memory. Thank you for the clarification on this matter and for solving the issue with GridSearchCV
, I really appreciate it. Feel free to close this issue if there is nothing else to be added.
The problem
I noticed that when Ivis compose a
sklearn.pipeline.Pipeline
which is passed tosklearn.model_selection.GridSearch
to fine-tune hyper-parameters across all estimators/transformers, andGridSearch
hasn_jobs=-1
(i.e., when executions withinGridSearch
are parallel), errors are thrown. This does not happen whenn_jobs=1
(i.e., when the executions withinGridSearch
are sequential).Since
Pipeline
globally regulates then_jobs
parameter, thus not supporting the parallelization of only specific steps, this problem forces the global use ofn_jobs=1
, which sensibly slows down the fine-tuning process by underusing the computational power of the setup in which the script is being executed (even in parts wheren_jobs=-1
would work).Environment
A virtual environment was created specifically to this repository, wherein all modules described in
requirements.txt
were installed. My setup runs an up-to-date version of Windows 10 (no WSL).Runtime
Relevant modules
Minimal reproducible example
Code
Error
Discussion
By coding and playing with the example above, I acquired the understanding that, since both
sklearn
usesjoblib
andivis
usesmultiprocessing
, these modules might not be playing well with each other for some reason.I would discard the understanding that nested estimators/transformers with parallel routines would be the problem: estimators like
sklearn.ensemble.RandomForestClassifier
can be set to haven_jobs=-1
without problem within thePipeline
passed toGridSearchCV
.I am particularly affected by this issue because I want to employ
ivis
in projects that involve hyper-parameter fine-tuning using cross-validation viaGridSearchCV
with concurrent executions. I attempted to diagnose the problem, but to no avail, which is why I bring this issue to your attention.Observation: another part of this problem is a design choice that is not adherent to the
sklearn
API guidelines, whose solution I propose and detail in #95. This issue does not cause the aforementioned error, but might cause other errors that could affect the same use scenario (Pipeline
inGridSearchCV
running in parallel).