flatironinstitute / sparse_dot

Python wrapper for Intel Math Kernel Library (MKL) matrix multiplication
MIT License
73 stars 10 forks source link

`dot_product_mkl` hangs if `multiprocessing` is used and an `sklearn`-module is loaded #7

Closed kostrykin closed 3 years ago

kostrykin commented 4 years ago

I have tested this on three different computers:

I used the following Conda environment:

name: testenv
channels:
 - anaconda
 - conda-forge
dependencies:
 - python>=3.6
 - numpy>=1.18
 - scipy
 - scikit-learn
 - mkl=2020.0

The execution of the function sparse_dot_mkl.dot_product_mkl hangs forever if the multiprocessing module is used and a specific scikit-learn module (sklearn.neighbors) is imported. I could not reproduce the behavior if only multiprocessing was used or only sklearn.neighbors was imported. It happens only if both of these conditions are given.

I have written a minimal example which reproduces this behavior:

import sys
import numpy as np
import multiprocessing
import scipy.sparse

if len(sys.argv) > 1 and sys.argv[1] == 'hang':
    import sklearn.neighbors

import sparse_dot_mkl as mkl

np.random.seed(0)
X = np.random.randn(10, 3)
X[X < 0.8] = 0
X = scipy.sparse.csr_matrix(X)
print(X.toarray())

def testfunc(*args):
    print('mkl_dot-->')
    mkl.dot_product_mkl(X, np.zeros(X.shape[1]))
    print('<--mkl_dot')

pool = multiprocessing.Pool(processes=2)
for result in pool.imap_unordered(testfunc, range(5)):
    print(result)
pool.close()

Save it as test.py and try running python3 test.py. You will see, that the execution terminates within less than a second. Then, try running python3 test.py hang. You will see, that the execution never ends. Moreover, you will see from the output that the execution of the dot_product_mkl routine started on both sub-processes ('mkl_dot--> appears twice) but never ended (<--mkl_dot is missing).

I have tested this using the most recent version of sparse_dot. For convenience, I also packaged this version along with the above described test.py script for download: sparse_dot-release-with-test.zip

Further observations:

asistradition commented 4 years ago

Well I can reproduce this (and I can reproduce it with pathos multiprocessing too). Off the top of my head I have no idea. Give me a couple days to work on it.

asistradition commented 4 years ago

So sklearn, when first imported, sets an env that's a workaround for something. I don't know why or what it's doing.

# Workaround issue discovered in intel-openmp 2019.5:
# https://github.com/ContinuumIO/anaconda-issues/issues/11294
os.environ.setdefault("KMP_INIT_AT_FORK", "FALSE")

This env is hanging MKL because it's changing the behavior of intel's openmp threading (the dynamic link library for MKL reads all the envs when first imported).

I can resolve this issue by unsetting the env between importing sklearn and importing mkl.

import sklearn
del os.environ['KMP_INIT_AT_FORK']
import sparse_dot_mkl as mkl

I will also write a workaround for this but I might not get to it for a couple days.

asistradition commented 4 years ago

I fixed this in cdcf7c1300002e7b27b435f7f5da7cdda4eeb905 and released it as v0.5.3. I'm going to leave this open for now because the fix is ugly.

leycec commented 3 years ago

@asistradition: Despite the ugliness of the kludge in cdcf7c1, it might be useful to finally close this. After all, that commit does technically resolve the issue – right?

I suggest this as our team initially discarded sparse_dot from consideration after seeing this critical open issue pertaining to non-deterministic infinite loops. Of course, I then read the issue in entirety and realized this isn't actually an issue anymore.

Thanks for all the continued hard work on sparse_dot. We're slowly migrating away from single-threaded scipy.sparse.linalg solvers ourselves, so everything here looks immensely relevant and awesome!

tl;dr

We'll probably use sparse_dot now. But we weren't going to, because of this open issue which isn't actually an issue. :man_shrugging:

leycec commented 3 years ago

Thanks, @kostrykin! You're all the awesome. :+1: