dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.56k stars 715 forks source link

Low-level profiler issue on OSX #2608

Open pentschev opened 5 years ago

pentschev commented 5 years ago

@mrocklin encountered the following issue on OSX when attempting to use the low-level profiler:

In [1]: from dask.distributed import Client

In [2]: client = Client(low_level_profiler=True)

In [3]: import dask.array as da

In [4]: x = da.random.random((10000, 10000), chunks=(1000, 1000))

In [5]: (x + x.T - x.mean(axis=0)).sum().compute()
libunwind: malformed __unwind_info at 0x116CF648C bad second level page  # <--- here
Out[5]: 50003173.58741013

The error message comes from LLVM's libunwind, this is possible to fail also on Linux if we attempt using LLVM.

It's not clear at the moment if this is a libunwind bug, stacktrace or from Dask distributed.

mrocklin commented 5 years ago

@sklam do you have any thoughts on how best to proceed with this?

sklam commented 5 years ago

I believe the error comes from https://github.com/llvm/llvm-project/blob/release/4.x/libunwind/src/UnwindCursor.hpp#L1121 and it should be harmless. We can check if there's a way to turn off the debug log from libunwind. It's also possible to try newer libunwind version. Currently, we are getting libunwind=4.0.1 from conda's defaults.

In additional, stacktrace is guarding any segmentation fault from libunwind due to misinterpretion of the unwind info. See https://github.com/numba/stacktrace/blob/e472bb124f266a64568326756033f23b5a6205b6/stacktrace/bt.c#L166-L167. So, even if libunwind fails to read the unwind info, stacktrace will not crash. It will just return truncated stack info.

sklam commented 5 years ago

I cannot replicate the log message. Can you share your conda-environment listing? conda list --export --explicit

pentschev commented 5 years ago

Sure, here's mine:

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: osx-64
@EXPLICIT
https://repo.anaconda.com/pkgs/main/osx-64/blas-1.0-mkl.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/bzip2-1.0.6-h1de35cc_5.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/ca-certificates-2019.1.23-0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/intel-openmp-2019.3-199.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/jpeg-9b-he5867d9_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libcxxabi-4.0.1-hcfea43d_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libgfortran-3.0.1-h93005f0_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/xz-5.2.4-h1de35cc_4.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/yaml-0.1.7-hc338f04_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/zlib-1.2.11-h1de35cc_3.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libcxx-4.0.1-hcfea43d_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libpng-1.6.36-ha441bb4_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/mkl-2019.3-199.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/openssl-1.1.1b-h1de35cc_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/tk-8.6.8-ha441bb4_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/zstd-1.3.7-h5bba6e5_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/expat-2.2.6-h0a44026_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/freetype-2.9.1-hb4e5f40_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libffi-3.2.1-h475c297_4.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libssh2-1.8.0-ha12b0ac_4.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libtiff-4.0.10-hcb84e12_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libunwind-4.0.1-hcfea43d_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/ncurses-6.1-h0a44026_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/rhash-1.3.8-ha12b0ac_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libedit-3.1.20181209-hb402a30_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/readline-7.0-h1de35cc_5.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/krb5-1.16.1-hddcf347_7.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/sqlite-3.27.2-ha441bb4_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/libcurl-7.64.0-h051b688_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/python-3.7.3-h359304d_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/atomicwrites-1.3.0-py37_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/attrs-19.1.0-py37_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/certifi-2019.3.9-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/click-7.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/cloudpickle-0.8.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/cmake-3.14.0-haff7e42_0.tar.bz2
https://repo.anaconda.com/pkgs/main/noarch/dask-core-1.1.5-py_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/heapdict-1.0.0-py37_2.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/llvmlite-0.28.0-py37h8c7ce04_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/locket-0.2.0-py37_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/markupsafe-1.1.1-py37h1de35cc_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/more-itertools-6.0.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/msgpack-python-0.6.1-py37h04f5b5a_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/olefile-0.46-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pluggy-0.9.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/psutil-5.6.1-py37h1de35cc_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/py-1.8.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pyparsing-2.3.1-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pytz-2018.9-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pyyaml-5.1-py37h1de35cc_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/six-1.12.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/sortedcontainers-2.1.0-py37_0.tar.bz2
https://conda.anaconda.org/numba/osx-64/stacktrace-0.1dev0-py37h1de35cc_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/tblib-1.3.2-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/toolz-0.9.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/tornado-6.0.2-py37h1de35cc_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/cytoolz-0.9.0.1-py37h1de35cc_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/packaging-19.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/partd-0.3.10-py37_1.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pillow-5.4.1-py37hb68e598_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/python-dateutil-2.8.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/setuptools-40.8.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/zict-0.1.4-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/distributed-1.26.1-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/jinja2-2.10-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pytest-4.3.1-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/wheel-0.33.1-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pip-19.0.3-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/bokeh-1.1.0-py37_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/mkl_fft-1.0.10-py37h5e564d8_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/mkl_random-1.0.2-py37h27c97d8_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/numba-0.43.1-py37h6440ff4_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/numpy-1.16.2-py37hacdab7b_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/numpy-base-1.16.2-py37h6575580_0.tar.bz2
https://repo.anaconda.com/pkgs/main/osx-64/pandas-0.24.2-py37h0a44026_0.tar.bz2
https://repo.anaconda.com/pkgs/main/noarch/dask-1.1.5-py_0.tar.bz2
pentschev commented 5 years ago

I also noticed that this issue is not deterministic, it's more likely to get it after a few (~5) iterations.

This issue may be harmless as you said @sklam, but it always freezes for me after some iterations of the (x + x.T - x.mean(axis=0)).sum().compute() line from @mrocklin's example.

pentschev commented 5 years ago

I've just compiled libunwind from llvm's master branch and then installed stacktrace linking to that, but I still get the same issues:

libunwind: malformed __unwind_info at 0x1045ABE88 bad second level page
49994648.18507994
libunwind: malformed __unwind_info at 0x7FFF78135240 bad second level page
49994648.18507994
libunwind: malformed __unwind_info at 0x7FFF78135240 bad second level page
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
49994648.18507994
libunwind: malformed __unwind_info at 0x1045ABE88 bad second level page
49994648.18507994
49994648.18507994
49994648.18507994
libunwind: malformed __unwind_info at 0x7FFF78135240 bad second level page

The process got stuck at this point, it was supposed to run 100 iterations, but froze after about 15.

I decided to attach lldb to the process, this is what I see immediately:

(lldb) process attach --pid 6843

_bt.cpython-37m-darwin.so was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 6843 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000114b54ab0 _bt.cpython-37m-darwin.so`backtrace_thread [inlined] _wait_and_reset_signal at bt.c:182:35 [opt]
   179  static
   180  void _wait_and_reset_signal(struct sigaction *old_sa) {
   181    // spin and wait.
-> 182    while (TheSignalConfig.spinlock == 0);
   183    // reset signal handlers
   184    sigaction(SIG_STACKTRACE, old_sa, NULL);
   185  }
Target 0: (python) stopped.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000114b54ab0 _bt.cpython-37m-darwin.so`backtrace_thread [inlined] _wait_and_reset_signal at bt.c:182:35 [opt]
    frame #1: 0x0000000114b54aab _bt.cpython-37m-darwin.so`backtrace_thread(tid=0x00007000054dd000, cb=(0x0000000103022fc0), maxdepth=100) at bt.c:215 [opt]
    frame #2: 0x000000010301e884 libffi.6.dylib`ffi_call_unix64 + 76
    frame #3: 0x000000010301de8b libffi.6.dylib`ffi_call + 939
    frame #4: 0x0000000103002c48 _ctypes.cpython-37m-darwin.so`_ctypes_callproc + 632
    frame #5: 0x0000000102ffce9c _ctypes.cpython-37m-darwin.so`PyCFuncPtr_call + 268
    frame #6: 0x0000000101cabae3 python`_PyObject_FastCallKeywords + 179
    frame #7: 0x0000000101de8ed5 python`call_function + 453
    frame #8: 0x0000000101de6aec python`_PyEval_EvalFrameDefault + 46092
    frame #9: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #10: 0x0000000101cabfe3 python`_PyFunction_FastCallKeywords + 195
    frame #11: 0x0000000101de8dc7 python`call_function + 183
    frame #12: 0x0000000101de6be0 python`_PyEval_EvalFrameDefault + 46336
    frame #13: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #14: 0x0000000101de8dc7 python`call_function + 183
    frame #15: 0x0000000101de6aec python`_PyEval_EvalFrameDefault + 46092
    frame #16: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #17: 0x0000000101cabfe3 python`_PyFunction_FastCallKeywords + 195
    frame #18: 0x0000000101de8dc7 python`call_function + 183
    frame #19: 0x0000000101de6b27 python`_PyEval_EvalFrameDefault + 46151
    frame #20: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #21: 0x0000000101cabfe3 python`_PyFunction_FastCallKeywords + 195
    frame #22: 0x0000000101de8dc7 python`call_function + 183
    frame #23: 0x0000000101de6aec python`_PyEval_EvalFrameDefault + 46092
    frame #24: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #25: 0x0000000101caece2 python`method_call + 130
    frame #26: 0x0000000101cac752 python`PyObject_Call + 130
    frame #27: 0x0000000101ea937d python`partial_call + 285
    frame #28: 0x0000000101cabae3 python`_PyObject_FastCallKeywords + 179
    frame #29: 0x0000000101de8ed5 python`call_function + 453
    frame #30: 0x0000000101de6b27 python`_PyEval_EvalFrameDefault + 46151
    frame #31: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #32: 0x0000000101caece2 python`method_call + 130
    frame #33: 0x0000000101cabae3 python`_PyObject_FastCallKeywords + 179
    frame #34: 0x0000000101e353aa python`context_run + 106
    frame #35: 0x0000000101cab27f python`_PyMethodDef_RawFastCallDict + 335
    frame #36: 0x0000000101cac8fd python`PyCFunction_Call + 61
    frame #37: 0x0000000101de6cd7 python`_PyEval_EvalFrameDefault + 46583
    frame #38: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #39: 0x0000000101de8dc7 python`call_function + 183
    frame #40: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #41: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #42: 0x0000000101de8dc7 python`call_function + 183
    frame #43: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #44: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #45: 0x0000000101de8dc7 python`call_function + 183
    frame #46: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #47: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #48: 0x0000000101de8dc7 python`call_function + 183
    frame #49: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #50: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #51: 0x0000000101cabfe3 python`_PyFunction_FastCallKeywords + 195
    frame #52: 0x0000000101de8dc7 python`call_function + 183
    frame #53: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #54: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #55: 0x0000000101caade7 python`_PyFunction_FastCallDict + 231
    frame #56: 0x0000000101caece2 python`method_call + 130
    frame #57: 0x0000000101cac752 python`PyObject_Call + 130
    frame #58: 0x0000000101de6d58 python`_PyEval_EvalFrameDefault + 46712
    frame #59: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #60: 0x0000000101caece2 python`method_call + 130
    frame #61: 0x0000000101cac752 python`PyObject_Call + 130
    frame #62: 0x0000000101de6d58 python`_PyEval_EvalFrameDefault + 46712
    frame #63: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #64: 0x0000000101de8dc7 python`call_function + 183
    frame #65: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #66: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #67: 0x0000000101de8dc7 python`call_function + 183
    frame #68: 0x0000000101de6a56 python`_PyEval_EvalFrameDefault + 45942
    frame #69: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #70: 0x0000000101de8dc7 python`call_function + 183
    frame #71: 0x0000000101de6aec python`_PyEval_EvalFrameDefault + 46092
    frame #72: 0x0000000101cab8d5 python`function_code_fastcall + 117
    frame #73: 0x0000000101de8dc7 python`call_function + 183
    frame #74: 0x0000000101de6b27 python`_PyEval_EvalFrameDefault + 46151
    frame #75: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #76: 0x0000000101caade7 python`_PyFunction_FastCallDict + 231
    frame #77: 0x0000000101de6d58 python`_PyEval_EvalFrameDefault + 46712
    frame #78: 0x0000000101dda49e python`_PyEval_EvalCodeWithName + 414
    frame #79: 0x0000000101e40699 python`PyRun_StringFlags + 185
    frame #80: 0x0000000101e4052f python`PyRun_SimpleStringFlags + 143
    frame #81: 0x0000000101e6a2c8 python`pymain_main + 6984
    frame #82: 0x0000000101c7e66d python`main + 125
    frame #83: 0x00007fff77f4c3d5 libdyld.dylib`start + 1
    frame #84: 0x00007fff77f4c3d5 libdyld.dylib`start + 1
sklam commented 5 years ago

If it is freezing, it is bad.

Were there other threads running. The mainthread is waiting (spinlocked) for the "profiled" thread to finish.

pentschev commented 5 years ago

Yes, there are other threads. Please see https://gist.github.com/pentschev/0e13fa1fef01565c1c63101db1e22d68. I think the only threads running some stacktrace code at that moment are threads 1, 4 and 6.