HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.95k stars 539 forks source link

numpy/pandas concurrency issue when being imported in a udf #595

Closed xiaoling closed 7 years ago

xiaoling commented 7 years ago

I have experienced some issue with deepdive when my udf imports numpy (or pandas).

I was able to reproduce the issue in a simple app as follows. app.ddlog:

t1(
a text
).

t2(
b text
).

function a over rows like t1
  returns rows like t2
  implementation "udf/test.py" handles tsv lines.

t2 += a(s) :- t1(s)

udf/test.py:

#!/usr/bin/env python3
import pandas # the same happens with `import numpy`

for line in sys.stdin:
    print(line.strip())

I got the following error message when I ran deepdive run :

2016-09-24 02:33:34.943052 OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
2016-09-24 02:33:34.943067 OpenBLAS blas_thread_init: RLIMIT_NPROC 1289417 current, 1289417 max
2016-09-24 02:33:34.943396 Failed to import the site module
2016-09-24 02:33:34.946417 bash: line 1: 38615 Segmentation fault      (core dumped) udf/test.py
2016-09-24 02:33:34.948102 Traceback (most recent call last):
2016-09-24 02:33:34.948135   File "udf/test.py", line 7, in <module>
2016-09-24 02:33:34.948153     import numpy
2016-09-24 02:33:34.948168   File "/home/xling/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/__init__.py", line 184, in <module>
2016-09-24 02:33:34.948182     from . import add_newdocs
2016-09-24 02:33:34.948197   File "/home/xling/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/add_newdocs.py", line 13, in <module>
2016-09-24 02:33:34.948213     from numpy.lib import add_newdoc
2016-09-24 02:33:34.948227   File "/home/xling/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/lib/__init__.py", line 8, in <module>
2016-09-24 02:33:34.948241     from .type_check import *
2016-09-24 02:33:34.948256   File "/home/xling/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/lib/type_check.py", line 11, in <module>
2016-09-24 02:33:34.948272     import numpy.core.numeric as _nx
2016-09-24 02:33:34.948292   File "/home/xling/miniconda3/envs/dev/lib/python3.5/site-packages/numpy/core/__init__.py", line 25, in <module>
2016-09-24 02:33:34.948306     from . import numeric
2016-09-24 02:33:34.948321   File "<frozen importlib._bootstrap>", line 969, in _find_and_load
2016-09-24 02:33:34.948338   File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
2016-09-24 02:33:34.948354   File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
2016-09-24 02:33:34.948368   File "<frozen importlib._bootstrap_external>", line 661, in exec_module
2016-09-24 02:33:34.948380   File "<frozen importlib._bootstrap_external>", line 750, in get_code
2016-09-24 02:33:34.948396   File "<frozen importlib._bootstrap_external>", line 819, in get_data
2016-09-24 02:33:34.948416 MemoryError
2016-09-24 02:33:34.956605 bash: line 1: 38653 Segmentation fault      (core dumped) udf/test.py

Interestingly, at first I can only reproduce this issue on one machine (40 core) but not the other machine (32 core) with an almost identical environment.

@alldefector identified that it's a concurrency issue. We narrowed it down to compute-execute link. Basically, if the for loop runs fast enough and two processes load numpy at the same time, it will crash. The reason that we didn't observe on the 32-core machine is probably because it didn't run the for loop fast enough. We stress-tested it with

export num_processes=1000
for i in $(seq $num_processes); do python -c 'import numpy' & done

and was able to reproduce this same issue on both machines.

We looked at the core dump using gdb

gdb `which python` /var/crash/core-python-11-1004-1004-9431-1474685177

and here is the stack trace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffb028ce259 in blas_memory_alloc ()
   from /home/xiao/miniconda3/envs/py3/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#2  0x00007ffb028ce90b in blas_thread_server ()
   from /home/xiao/miniconda3/envs/py3/lib/python3.5/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#3  0x00007ffb06840184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007ffb05c5837d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Envs: Ubuntu 14.04 deepdive master branch numpy 1.11.1 with openblas

A work-around is to add a random time delay before each numpy import. Any suggestions? @netj

netj commented 7 years ago

@xiaoling Can't think of a better solution than the random delays. If numpy already handles parallelization, you could limit DEEPDIVE_NUM_PROCESSES=1.

xiaoling commented 7 years ago

Yeah, OPENBLAS_NUM_THREADS=1 seems to work without changing anything from deepdive.

netj commented 7 years ago

This is mainly an issue with numpy/OpenBLAS that DeepDive cannot do much about. Closing since a workaround is recorded here. Let's add these stuffs to FAQ if people leave lots of reactions.