Open rmg55 opened 4 years ago
Thanks for the copy-pastable example. I'm not able to reproduce locally. Can you reproduce it in a fresh environment?
Hi Tom,
Thanks for the quick response! I am working within a singularity container (without write permissions) on an HPC. Here is a link to the image I am using - Dockerfile
Any suggestions on how I might be able to further debug (I struggle when trying debug segmentation faults)? I will try to reproduce in a fresh environment, but thought I would pass along the image in case you would like to try it..
Thanks. I'm not sure why there would be a segfault, but it likely isn't from Dask. We're just coordinating calls to scikit-learn here.
You might watch the dashboard and see if anything strange happens before the worker dies (perhaps suddenly high memory usage and the job scheduler kills the worker? Some HPC systems don't let you easily spill to disk).
Hi @TomAugspurger,
I can confirm that I am able to run the example successfully in a local conda environment. However, I am still having issues running the example in the singularity image (DockerHub Image)
I get the same errors when I try:
singularity exec docker://rowangaffney/data_science_im_rs:latest /opt/conda/envs/py_geo/bin/python min_example.py
This is probably out of scope for dask-ml, but thought I should post my update on the issue. If you have any further ideas/directions on how to debug, that would be great - otherwise, feel free to close and I can try with the singularity project.
Thanks for the update.
I'm not especially sure where to go next for debugging... You might try with different schedulers
import dask
dask.config.set(scheduler="single-threaded")
If 1 passes, that tells us there's (maybe) some issue with communication / coordination between processes.
If 1 fails but 2 passes, that tells us there's an issue with using this scikit-learn code from multiple threads.
Thanks @TomAugspurger
1 works, but 2 fails (see below). I guess that suggests that there is an issue with the communication / coordination between processes. Seems odd that the SelectKBest works, but the PCA does not...
I am running this in an HPC environment (via SLURM and JupyterHub) within a singularity container. When launching the container, I am bind mounting the following.
--bind /etc/munge --bind /var/log/munge --bind /var/run/munge --bind /usr/bin/gdb \
--bind /usr/bin/squeue --bind /usr/bin/scancel --bind /usr/bin/sbatch \
--bind /usr/bin/scontrol --bind /usr/bin/sinfo --bind /system/slurm:/etc/slurm \
--bind /run/munge --bind /usr/lib64 --bind /scinet01 --bind $HOME \
--bind /software/7/apps/envi -H $HOME:/home/jovyan
However, when I run it with the single threaded scheduler, it fails
Shot in the dark: can you try disabling spill to disk? https://jobqueue.dask.org/en/latest/configuration-setup.html#no-local-storage
hmm still seeing the same issue when I avoid spilling to disk by:
dask.config.set({'distributed.worker.memory.target': False, 'distributed.worker.memory.spill': False})
cluster = LocalCluster(n_workers=5,threads_per_worker=2)
client = Client(cluster)
dask.config.config
I did see the following in one of the work logs:
cluster.logs()
Hmm I'm not sure what to try next :/
On Thu, Apr 2, 2020 at 6:24 PM Rowan Gaffney notifications@github.com wrote:
Reopened #629 https://github.com/dask/dask-ml/issues/629.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/629#event-3194268558, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIX3GEIZ3DWQBAZF3GDRKUNCRANCNFSM4LUYUFEQ .
Ok, Thanks for you help @TomAugspurger. I'll close as I kinda think this may be an issue with the singularity container or related to the HPC system.
Noting over here from: https://github.com/sylabs/singularity/issues/5259 so there's a pointer in case others come across it here.
It looks like you are binding the entire /usr/lib64
host library directory into the container.
--bind /usr/lib64
This will almost certainly cause issues including segfaults, unless the container OS exactly matches the host - because the executables in the container expect to use libraries from the container... not the ones from the host which will be a different version / built differently.
Also - when you run Singularity containers with python apps, python packages installed in your $HOME
with pip install --user
can interfere. Try --contain
to avoid that.
I was able to do some more debugging that might help with diagnosing the issue. Using gdb to examine the Seg Faults (via the core dump files) I can get the following back traces. @TomAugspurger, @dctrud, and @ynanyam, any idea if this is an issue within Singularity (aka issues with shared libraries between host of container) or Python library issue? Thanks!
gdb /opt/conda/env/py_geo/pyhton core.XXXX
I get: bt
) I get:, globals=, locals=, args=, argcount=, kwnames=0x0,
kwargs=0x2ac27b984e20, kwcount=, kwstep=1, defs=0x2ac27b09a6e8, defcount=1, kwdefs=0x0, closure=0x0, name='safe_sparse_dot', qualname='safe_sparse_dot')
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3930
#14 0x000055dcccf0e985 in _PyFunction_FastCallKeywords (func=, stack=0x2ac27b984e10, nargs=2, kwnames=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:433
#15 0x000055dcccf77216 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#16 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3124
#17 0x000055dcccebf8f9 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=0x0,
kwargs=0x2ac290353960, kwcount=, kwstep=1, defs=0x2ac27b09e478, defcount=2, kwdefs=0x0, closure=0x0, name='randomized_range_finder', qualname='randomized_range_finder')
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3930
#18 0x000055dcccf0e985 in _PyFunction_FastCallKeywords (func=, stack=0x2ac290353938, nargs=5, kwnames=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:433
#19 0x000055dcccf77216 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#20 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3124
#21 0x000055dcccebf8f9 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=0x2ac27b6c5cc8,
kwargs=0x55dcd02dc748, kwcount=, kwstep=1, defs=0x2ac27b09d458, defcount=6, kwdefs=0x0, closure=0x0, name='randomized_svd', qualname='randomized_svd')
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3930
#22 0x000055dcccf0e9e7 in _PyFunction_FastCallKeywords (func=, stack=0x55dcd02dc740, nargs=1, kwnames=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:433
#23 0x000055dcccf782e7 in call_function (kwnames=('n_components', 'n_iter', 'flip_sign', 'random_state'), oparg=, pp_stack=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#24 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3139
#25 0x000055dcccf0e75b in function_code_fastcall (globals=, nargs=4, args=, co=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#26 _PyFunction_FastCallKeywords (func=, stack=0x2ac2900015b8, nargs=4, kwnames=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:408
#27 0x000055dcccf774a0 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#28 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3110
#29 0x000055dcccf0e75b in function_code_fastcall (globals=, nargs=2, args=, co=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#30 _PyFunction_FastCallKeywords (func=, stack=0x55dcd02952e0, nargs=2, kwnames=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:408
#31 0x000055dcccf774a0 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#32 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3110
#33 0x000055dcccebf8f9 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=0x0, kwargs=0x0,
kwcount=, kwstep=2, defs=0x2ac27b6c7968, defcount=1, kwdefs=0x0, closure=0x0, name='fit_transform', qualname='PCA.fit_transform')
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3930
#34 0x000055dcccec0a35 in _PyFunction_FastCallDict (func=, args=0x2ac278f9bf50, nargs=3, kwargs=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:376
--Type for more, q to quit, c to continue without paging--
#35 0x000055dcccedee03 in _PyObject_Call_Prepend (callable=, obj=, args=(, ),
kwargs={}) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:908
#36 0x000055dccced175e in PyObject_Call (callable=, args=, kwargs=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:245
#37 0x000055dcccf78d6a in do_call_core (kwdict={}, callargs=(, ), func=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4645
#38 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3191
#39 0x000055dcccebf8f9 in _PyEval_EvalCodeWithName (_co=, globals=, locals=, args=, argcount=, kwnames=0x0, kwargs=0x0,
kwcount=, kwstep=2, defs=0x2ac29d66d0c8, defcount=4, kwdefs=0x0, closure=0x0, name='fit_transform', qualname='fit_transform')
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3930
#40 0x000055dcccec0a35 in _PyFunction_FastCallDict (func=, args=0x2ac27b726cc8, nargs=7, kwargs=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:376
#41 0x000055dcccf78d6a in do_call_core (kwdict=0x0,
callargs=(, iterated_power='auto', random_state=None) at remote 0x2ac278b6b250>, , , 'raise', ['n_components'], (90,), None), func=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4645
#42 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3191
#43 0x000055dcccec096b in function_code_fastcall (globals=, nargs=1, args=, co=0x2ac2780a19c0) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#44 _PyFunction_FastCallDict (func=, args=0x2ac278b5d728, nargs=1, kwargs=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:322
#45 0x000055dcccf78d6a in do_call_core (kwdict={},
callargs=((, , iterated_power='auto', random_state=None) at remote 0x2ac278b6b250>, (, , ), (, ), (, )], pairwise=False, cache={(0, True, True): , (0, False, True): , (1, True, True): , (1, False, True): }, num_train_samples=1200) at remote 0x2ac27b73e0d0>, , , True, True, 1), (, <...>, ) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4645
#46 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3191
#47 0x000055dcccec096b in function_code_fastcall (globals=, nargs=8, args=, co=0x2ac2780a1d20) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#48 _PyFunction_FastCallDict (func=, args=0x2ac29d6814e8, nargs=8, kwargs=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:322
#49 0x000055dcccf78d6a in do_call_core (kwdict={},
callargs=(, ((, , iterated_power='auto', random_state=None) at remote 0x2ac278b6b250>, (, , ), (, ), (, )], pairwise=False, cache={(0, True, True): , (0, False, True): , (1, True, True): , (1, False, True): }, num_train_samples=1200) at remote 0x2ac27b73e0d0>, , , True, True, 1), (, <...>, ...(truncated),
func=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4645
#50 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3191
#51 0x000055dcccf0e75b in function_code_fastcall (globals=, nargs=1, args=, co=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#52 _PyFunction_FastCallKeywords (func=, stack=0x2ac27b72d5e8, nargs=1, kwnames=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:408
#53 0x000055dcccf774a0 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#54 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3110
#55 0x000055dcccec096b in function_code_fastcall (globals=, nargs=2, args=, co=0x2ac277d266f0) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#56 _PyFunction_FastCallDict (func=, args=0x2ac278940f18, nargs=2, kwargs=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:322
#57 0x000055dcccf78d6a in do_call_core (kwdict={},
callargs=(, mutex=<_thread.lock at remote 0x2ac2781f6fc0>, not_empty=, acquire=, release=, _waiters=) at remote 0x2ac278461f90>, not_full=, acquire=, release=, _waiters=) at remote 0x2ac278461410>, all_tasks_done=, acquire=, release=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4645
#58 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3191
--Type for more, q to quit, c to continue without paging--
#59 0x000055dcccf0e75b in function_code_fastcall (globals=, nargs=1, args=, co=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#60 _PyFunction_FastCallKeywords (func=, stack=0x2ac290000cd0, nargs=1, kwnames=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:408
#61 0x000055dcccf774a0 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#62 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3110
#63 0x000055dcccf0e75b in function_code_fastcall (globals=, nargs=1, args=, co=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#64 _PyFunction_FastCallKeywords (func=, stack=0x2ac27b5c9ad8, nargs=1, kwnames=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:408
#65 0x000055dcccf774a0 in call_function (kwnames=0x0, oparg=, pp_stack=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:4616
#66 _PyEval_EvalFrameDefault (f=, throwflag=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/ceval.c:3110
#67 0x000055dcccec096b in function_code_fastcall (globals=, nargs=1, args=, co=0x2ac26c17a390) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:283
#68 _PyFunction_FastCallDict (func=, args=0x2ac278f9ce00, nargs=1, kwargs=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:322
#69 0x000055dcccedee03 in _PyObject_Call_Prepend (callable=, obj=, args=(), kwargs=0x0)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:908
#70 0x000055dccced175e in PyObject_Call (callable=, args=, kwargs=)
at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Objects/call.c:245
#71 0x000055dcccfcf6a7 in t_bootstrap (boot_raw=0x2ac27894f2a0) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Modules/_threadmodule.c:994
#72 0x000055dcccf8a418 in pythread_wrapper (arg=) at /home/conda/feedstock_root/build_artifacts/python_1585001848288/work/Python/thread_pthread.h:174
#73 0x00002ac26add36db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#74 0x00002ac26b10c88f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
py-bt
) I get:Thanks for the additional debugging, but it unfortunately doesn't give me any new guesses :/
There seems to be an issue with sklearn PCA+pipeline and dask_ml gridsearchCV. Please see my example below. Apologies if I am totally missing something.
Relevant Versions:
``` dask: 2.12.0 dask_ml: 1.2.0 sklearn: 0.22.2.post1 ```
Minimal Example:
The following code shows that
This results in several core dump files and the following error:
Results In:
``` 0.18358307544807187 0.16009973534039232 distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Restarting worker ('score-f44a4381cb4779b9d45ba2c0ba7c2a72', 15, 1) has failed... retrying --------------------------------------------------------------------------- KeyError Traceback (most recent call last)
Dask Distributed worker / scheduler logs
``` {'Scheduler': 'distributed.scheduler - INFO - Clear task state\n' 'distributed.scheduler - INFO - Scheduler at: ' 'tcp://127.0.0.1:41727\n' 'distributed.scheduler - INFO - dashboard at: ' '127.0.0.1:8787\n' 'distributed.scheduler - INFO - Register worker
Results of gdb on core dump file:
``` Core was generated by `/opt/conda/envs/py_geo/bin/python -c from multiprocessing.forkserver import mai'. Program terminated with signal 11, Segmentation fault. ```