Closed kostrykin closed 3 years ago
Well I can reproduce this (and I can reproduce it with pathos
multiprocessing too). Off the top of my head I have no idea. Give me a couple days to work on it.
So sklearn, when first imported, sets an env that's a workaround for something. I don't know why or what it's doing.
# Workaround issue discovered in intel-openmp 2019.5:
# https://github.com/ContinuumIO/anaconda-issues/issues/11294
os.environ.setdefault("KMP_INIT_AT_FORK", "FALSE")
This env is hanging MKL because it's changing the behavior of intel's openmp threading (the dynamic link library for MKL reads all the envs when first imported).
I can resolve this issue by unsetting the env between importing sklearn and importing mkl.
import sklearn
del os.environ['KMP_INIT_AT_FORK']
import sparse_dot_mkl as mkl
I will also write a workaround for this but I might not get to it for a couple days.
I fixed this in cdcf7c1300002e7b27b435f7f5da7cdda4eeb905 and released it as v0.5.3. I'm going to leave this open for now because the fix is ugly.
@asistradition: Despite the ugliness of the kludge in cdcf7c1, it might be useful to finally close this. After all, that commit does technically resolve the issue – right?
I suggest this as our team initially discarded sparse_dot
from consideration after seeing this critical open issue pertaining to non-deterministic infinite loops. Of course, I then read the issue in entirety and realized this isn't actually an issue anymore.
Thanks for all the continued hard work on sparse_dot
. We're slowly migrating away from single-threaded scipy.sparse.linalg
solvers ourselves, so everything here looks immensely relevant and awesome!
We'll probably use sparse_dot
now. But we weren't going to, because of this open issue which isn't actually an issue. :man_shrugging:
Thanks, @kostrykin! You're all the awesome. :+1:
I have tested this on three different computers:
I used the following Conda environment:
The execution of the function
sparse_dot_mkl.dot_product_mkl
hangs forever if themultiprocessing
module is used and a specific scikit-learn module (sklearn.neighbors
) is imported. I could not reproduce the behavior if onlymultiprocessing
was used or onlysklearn.neighbors
was imported. It happens only if both of these conditions are given.I have written a minimal example which reproduces this behavior:
Save it as
test.py
and try runningpython3 test.py
. You will see, that the execution terminates within less than a second. Then, try runningpython3 test.py hang
. You will see, that the execution never ends. Moreover, you will see from the output that the execution of thedot_product_mkl
routine started on both sub-processes ('mkl_dot-->
appears twice) but never ended (<--mkl_dot
is missing).I have tested this using the most recent version of
sparse_dot
. For convenience, I also packaged this version along with the above describedtest.py
script for download: sparse_dot-release-with-test.zipFurther observations:
python3 test.py hang
runs fine ifsparse_dot_mkl
is imported before loadingsklearn.neighbors
.python3 test.py hang
runs fine if instead oftestenv
the following Conda environment is used:To my understanding, in the environment
testenv-mkl
MKL is preferred over OpenBLAS.