bloomberg / memray

Memray is a memory profiler for Python
https://bloomberg.github.io/memray/
Apache License 2.0
13.29k stars 394 forks source link

Possible deadlock caught with strace (double FUTEX_WAIT_PRIVATE) #669

Closed hdante closed 1 month ago

hdante commented 2 months ago

Is there an existing issue for this?

Current Behavior

Hello, I'm having a possible deadlock when executing memray and it seems to have appeared on the strace output log. The relevant snippet is here:

(...)
41784 clone(child_stack=0x7fd4d91fefb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd4d91ff9d0, tls=0x7fd4d91ff700, child_tidptr=0x7fd4d91ff9d0) = 41805
41784 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
41805 set_robust_list(0x7fd4d91ff9e0, 24 <unfinished ...>
41784 getrandom( <unfinished ...>
41805 <... set_robust_list resumed>)    = 0
41784 <... getrandom resumed>"\x06\x80\x35\x11\x1d\x77\x9c\xb0\x59\xb8\x96\x47\x4b\xc7\xe5\xe7\x66\xe8\x13\x26\xf6\xf9\x77\x27\x08\x1f\xe7\x9d\x28\x60\x2d\xa7", 32, GRND_NONBLOCK) = 32
41805 futex(0x7fd50c85a908, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
41784 open("/proc/sys/vm/overcommit_memory", O_RDONLY) = 5
41784 read(5, "0\n", 32)                = 2
41784 close(5)                          = 0
41784 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
41793 <... futex resumed>)              = -1 ETIMEDOUT (Connection timed out)
41793 futex(0x20c64a8, FUTEX_WAKE_PRIVATE, 1) = 0
41793 read(4, "290925 23905 8178 599 0 164167 0"..., 8191) = 33
41793 read(4, "", 8191)                 = 0
41793 lseek(4, 0, SEEK_SET)             = 0
41793 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
(...)

The actual program running is rail-estimate in function load_slow_modules(), from here:

https://github.com/linea-it/pz-compute/blob/main/rail_scripts/rail-estimate#L114

That function recursively loads a large amount of libraries, I can still find out more precisely the point it happened if necessary and if some other library could be causing the deadlock instead.

Expected Behavior

Without running memray, the load_slow_modules() function works normally and the python modules are successfully loaded.

Steps To Reproduce

memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare \
    $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5

Memray Version

1.13.4

Python Version

3.11

Operating System

Linux

Anything else?

Full strace log attached:

trace.gz

hdante commented 2 months ago

Correcting the source code branch:

https://github.com/linea-it/pz-compute/blob/lephare/rail_scripts/rail-estimate#L114

pablogsal commented 2 months ago

I cannot reproduce this:

python -m memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Error: unable to import program modules.
No module named 'rail.estimation.algos.lephare'
[memray] Successfully generated profile results.

You can now generate reports from the stored allocation records.
Some example commands to generate reports:

/Users/pgalindo3/.pyenv/versions/memray/bin/python -m memray flamegraph memray.bin

memray finishes correctly here without problems. I don't have access to the SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 file but not sure if that matters...

pablogsal commented 2 months ago

I tried also after installing pz-rail-lephare:

memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/Users/pgalindo3/Library/Caches/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/Users/pgalindo3/Library/Caches/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/Users/pgalindo3/Library/Caches/lephare/runs/20240821T181721
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
[memray] Successfully generated profile results.

You can now generate reports from the stored allocation records.
Some example commands to generate reports:

/Users/pgalindo3/.pyenv/versions/3.11.6/envs/memray/bin/python3.11 -m memray flamegraph memray.bin
pablogsal commented 2 months ago

If there is more setup required, please give us all the instructions required to reproduce the problem, with all the package versions and additional files.

pablogsal commented 2 months ago

Can you also give us a stack trace using gdb of the pace wheree the hang happens?

hdante commented 2 months ago

Hello, if you have a conda installation available, you may run the install-pz-rail script on a virtual environment to mirror my environment. If not, see if you can install these packages with pip:

pip install joblib
pip install scikit-learn
pip install scikit-learn-intelex
CFLAGS="-L $CONDA_PREFIX/lib -I $CONDA_PREFIX/include" pip install fitsio
CFLAGS="-O2 -DNDEBUG -std=gnu99" pip install pytdigest
pip install flexcode

# Install RAIL after all dependencies are installed
pip install pz-rail
pip install pz-rail-bpz
pip install pz-rail-flexzboost
pip install pz-rail-tpz
pip install pz-rail-gpz-v1
pip install pz-rail-lephare

I'll attach the input and calibration files here so that you may execute the program until the end, but it doesn't really matter for triggering the deadlock, which happens before the main estimation process.

objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5.gz estimator_lephare.pkl.gz

For the gdb stack trace do you mean attaching to the process after it hangs ?

hdante commented 2 months ago

For the lephare algorithm auxiliary data:

git clone https://github.com/lephare-photoz/lephare-data/ /Users/pgalindo3/Library/Caches/lephare/data
pablogsal commented 2 months ago

For the gdb stack trace do you mean attaching to the process after it hangs ?

Yeah or run it under gdb and then Ctrl-C and write bt.

pablogsal commented 2 months ago

I still cannot reproduce. The function load_slow_modules finishes without crashing but then there is some problem with your dependencies:

memray run -f -o memray.bin  pz-compute/rail_scripts/rail-estimate -a lephare \
    ./objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5.gz output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/home/pablogsal/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/pablogsal/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/pablogsal/.cache/lephare/runs/20240821T182524
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
Configuring estimator...
[memray] Successfully generated profile results.

You can now generate reports from the stored allocation records.
Some example commands to generate reports:

/home/pablogsal/.pyenv/versions/3.12.4/envs/memray/bin/python -m memray flamegraph memray.bin
Traceback (most recent call last):
  File "/home/pablogsal/.pyenv/versions/memray/bin/memray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pablogsal/github/memray/src/memray/commands/__init__.py", line 138, in main
    arg_values.entrypoint(arg_values, parser)
  File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 339, in run
    _run_with_file_output(args)
  File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 188, in _run_with_file_output
    _run_tracker(
  File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 77, in _run_tracker
    runpy.run_path(args.script, run_name="__main__")
  File "<frozen runpy>", line 286, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "pz-compute/rail_scripts/rail-estimate", line 250, in <module>
    main()
  File "pz-compute/rail_scripts/rail-estimate", line 245, in main
    ctx = setup(cfg)
          ^^^^^^^^^^
  File "pz-compute/rail_scripts/rail-estimate", line 202, in setup
    estimator = setup_estimator(
                ^^^^^^^^^^^^^^^^
  File "pz-compute/rail_scripts/rail-estimate", line 172, in setup_estimator
    estimator = Estimator.make_stage(**all_args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pablogsal/.pyenv/versions/3.12.4/envs/memray/lib/python3.12/site-packages/ceci/stage.py", line 123, in make_stage
    stage = cls(kwcopy, comm=comm, aliases=aliases)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LephareEstimator.__init__() got an unexpected keyword argument 'aliases'

but this happens after the point you mention so I am not sure what can be wrong here. I'm going to need some reproducer in a Dockerfile or your gdb trace

hdante commented 2 months ago

I'm sorry, you're right, I've installed lephare from the master branch. I have asked for a new release, but I think it's not there yet. The original issue is here:

https://github.com/lephare-photoz/lephare/issues/178#issuecomment-2297310118

So, confirming the exact lephare version used: 0.1.11.dev4+gff46ff8

Since your calls to load_slow_modules() are always succeeding, I think I'll need to better isolate the exact point of the deadlock. Meanwhile, here follows the gdb backtrace:

(base) [henrique.almeida@loginapl01 testlephare]$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11]
(no debugging symbols found)...done.
(gdb) run /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Starting program: /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_typing.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblz4-af1653fb.so.1.8.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/02/bd76e3669fdcbd9723caa978eda65336967338.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libunwind-d5b5a318.so.8.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/ae2c0f7bdf5eef40a96a35ea99f6555bf684e2.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libstdc++.so.6]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libgcc_s.so.1]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libssl-cd1d6220.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/7c/bcb0322f585236d81b557ed95c708f98a20c33.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcrypto-d3570994.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9b/03a5571e128c07bf955eb2bd8155311295ae50.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblzma-004595ca.so.5.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f7/44796ea6fa4d8019414f5487ee85de74dc5adc.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libgssapi_krb5-497db0c6.so.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/0c/aec124d97114da40ddeb0fed1fad5d14c3d626.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5-fcafa220.so.3.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/52/c5b6279da9cd210e5d58d1b1e0e080e5c8b232.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcom_err-2abe824b.so.2.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2a/ddb65846a50ce45f0c9b62ea35dda62c6ad7a2.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libk5crypto-b1f99d5c.so.3.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/f9d1075a8d5d62f77f5ce56c935fcd92c62efa.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5support-d0bcff84.so.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/77/9381063daecc27e8480c8f79f0651162586478.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkeyutils-dfe70bd6.so.1.5
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/01d5ac08c1280d013aab96b292ac58bc30a263.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libselinux-0922c95c.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/90/3a0bd0bfb4fee8c284f41beb9773ded94cbc52.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libpcre-9513aab5.so.1.2.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9c/a3d11f018beeb719cdb34be800bf1641350d0a.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/math.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_datetime.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_opcode.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/mmap.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/zlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/termios.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bisect.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_random.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_sha512.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/array.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_decimal.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bz2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/../../libbz2.so.1.0]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lzma.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_socket.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/select.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/fcntl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_posixsubprocess.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_contextvars.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_heapq.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ssl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/binascii.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/unicodedata.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_uuid.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_hashlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_json.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_csv.cpython-311-x86_64-linux-gnu.so]
Writing profile results into memray.bin
[New Thread 0x7fffe152d700 (LWP 27879)]
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lsprof.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/resource.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/pyexpat.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_queue.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/grp.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/mkl/../../../libiomp5.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/82/88a89ba4a984ed547d3ee90f99fda1d695461e.debug
[New Thread 0x7fffba3ff700 (LWP 27913)]
^C
Program received signal SIGINT, Interrupt.
0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) bt
#0  0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffe6420c8a in __gthread_mutex_lock (__mutex=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4  lock (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5  lock (this=0x7fffffff7c90) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:138
#6  unique_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:68
#7  trackAllocation (func=memray::hooks::CALLOC, size=1040, ptr=0x15ffe40) at src/memray/_memray/tracking_api.h:225
#8  memray::intercept::calloc(unsigned long, unsigned long) () at src/memray/_memray/hooks.cpp:229
#9  0x00007fffecc29ec4 in __new_exitfn () from /lib64/libc.so.6
#10 0x00007fffecc29f49 in __cxa_atexit_internal () from /lib64/libc.so.6
#11 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#14 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007fffed4c3fab in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fffed4c45ad in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007fffed4c4041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#19 0x00000000005fab46 in _PyImport_FindSharedFuncptr ()
#20 0x00000000005f9624 in _imp_create_dynamic ()
#21 0x000000000052ef5b in cfunction_vectorcall_FASTCALL ()
#22 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#23 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#24 0x0000000000527157 in object_vacall ()
#25 0x000000000055659c in PyObject_CallMethodObjArgs ()
#26 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#27 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#28 0x00000000005cbeda in _PyEval_Vector ()
#29 0x00000000005cb5af in PyEval_EvalCode ()
#30 0x00000000005e4d83 in builtin_exec ()
#31 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#32 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#33 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#34 0x0000000000527157 in object_vacall ()
#35 0x000000000055659c in PyObject_CallMethodObjArgs ()
#36 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#37 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#38 0x00000000005cbeda in _PyEval_Vector ()
#39 0x00000000005cb5af in PyEval_EvalCode ()
---Type <return> to continue, or q <return> to quit---
#40 0x00000000005e4d83 in builtin_exec ()
#41 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#42 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#43 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#44 0x0000000000527157 in object_vacall ()
#45 0x000000000055659c in PyObject_CallMethodObjArgs ()
#46 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#47 0x000000000056a800 in builtin___import__ ()
#48 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#49 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#50 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#51 0x0000000000527157 in object_vacall ()
#52 0x000000000055659c in PyObject_CallMethodObjArgs ()
#53 0x0000000000556301 in PyImport_ImportModuleLevelObject.localalias ()
#54 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#55 0x00000000005cbeda in _PyEval_Vector ()
#56 0x00000000005cb5af in PyEval_EvalCode ()
#57 0x00000000005e4d83 in builtin_exec ()
#58 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#59 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#60 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#61 0x0000000000527157 in object_vacall ()
#62 0x000000000055659c in PyObject_CallMethodObjArgs ()
#63 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#64 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#65 0x00000000005cbeda in _PyEval_Vector ()
#66 0x00000000005cb5af in PyEval_EvalCode ()
#67 0x00000000005e4d83 in builtin_exec ()
#68 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#69 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#70 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#71 0x0000000000527157 in object_vacall ()
#72 0x000000000055659c in PyObject_CallMethodObjArgs ()
#73 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#74 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#75 0x00000000005cbeda in _PyEval_Vector ()
#76 0x00000000005cb5af in PyEval_EvalCode ()
#77 0x00000000005e4d83 in builtin_exec ()
#78 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#79 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
---Type <return> to continue, or q <return> to quit---
#80 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#81 0x0000000000527157 in object_vacall ()
#82 0x000000000055659c in PyObject_CallMethodObjArgs ()
#83 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#84 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#85 0x00000000005cbeda in _PyEval_Vector ()
#86 0x00000000005cb5af in PyEval_EvalCode ()
#87 0x00000000005e4d83 in builtin_exec ()
#88 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#89 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#90 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#91 0x0000000000527157 in object_vacall ()
#92 0x000000000055659c in PyObject_CallMethodObjArgs ()
#93 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#94 0x000000000056a800 in builtin___import__ ()
#95 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#96 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#97 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#98 0x0000000000527157 in object_vacall ()
#99 0x000000000055659c in PyObject_CallMethodObjArgs ()
#100 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#101 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#102 0x00000000005cbeda in _PyEval_Vector ()
#103 0x00000000005cb5af in PyEval_EvalCode ()
#104 0x00000000005e4d83 in builtin_exec ()
#105 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#106 0x000000000051e6a1 in PyObject_Vectorcall ()
#107 0x00000000004315b3 in _PyEval_EvalFrameDefault.cold ()
#108 0x00000000005cbeda in _PyEval_Vector ()
#109 0x00000000005cb5af in PyEval_EvalCode ()
#110 0x00000000005ec6a7 in run_eval_code_obj ()
#111 0x00000000005e8240 in run_mod ()
#112 0x00000000005fd192 in pyrun_file ()
#113 0x00000000005fc55f in _PyRun_SimpleFileObject.localalias ()
#114 0x00000000005fc283 in _PyRun_AnyFileObject.localalias ()
#115 0x00000000005f6efe in Py_RunMain.localalias ()
#116 0x00000000005bbc79 in Py_BytesMain ()
#117 0x00007fffecc12555 in __libc_start_main () from /lib64/libc.so.6
#118 0x00000000005bbac3 in _start ()
(gdb) info threads 
  Id   Target Id         Frame 
  3    Thread 0x7fffba3ff700 (LWP 27913) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
  2    Thread 0x7fffe152d700 (LWP 27879) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
* 1    Thread 0x7fffedaee740 (LWP 27867) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
pablogsal commented 2 months ago

There are 3 threads in your program, can you print the stack for all 3 of them?

You can switch from thread to thread using the “thread X” command and then type “bt” for every thread.

godlygeek commented 2 months ago

You can actually just thread apply all bt to have gdb print the stack for every thread one after another.

godlygeek commented 2 months ago

I've tried reproducing this on a RHEL 7.9 machine, and wasn't able to, either. I get to the same point as @pablogsal reached:

$ memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare ./objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/home/mwoznisk/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/mwoznisk/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/mwoznisk/.cache/lephare/runs/20240822T134206
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
Configuring estimator...
[memray] Successfully generated profile results.

You can now generate reports from the stored allocation records.
Some example commands to generate reports:

/tmp/deadlock-repro/venv/bin/python3.11 -m memray flamegraph memray.bin
Traceback (most recent call last):
  File "/tmp/deadlock-repro/venv/bin/memray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/__init__.py", line 138, in main
    arg_values.entrypoint(arg_values, parser)
  File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 339, in run
    _run_with_file_output(args)
  File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 188, in _run_with_file_output
    _run_tracker(
  File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 77, in _run_tracker
    runpy.run_path(args.script, run_name="__main__")
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "pz-compute/rail_scripts/rail-estimate", line 250, in <module>
    main()
  File "pz-compute/rail_scripts/rail-estimate", line 245, in main
    ctx = setup(cfg)
          ^^^^^^^^^^
  File "pz-compute/rail_scripts/rail-estimate", line 202, in setup
    estimator = setup_estimator(
                ^^^^^^^^^^^^^^^^
  File "pz-compute/rail_scripts/rail-estimate", line 172, in setup_estimator
    estimator = Estimator.make_stage(**all_args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/ceci/stage.py", line 123, in make_stage
    stage = cls(kwcopy, comm=comm, aliases=aliases)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LephareEstimator.__init__() got an unexpected keyword argument 'aliases'
godlygeek commented 2 months ago

@hdante FWIW, the strace log isn't very useful. strace shows only system calls, and the entire idea behind futexes - "fast userspace mutexes" - is that they typically don't need to make any system calls, and can get all the work they need to do done in userspace. Since there's no system call when there's no contention, we can't figure anything out about what locks are held by what threads at any given point in time from the strace output.

I think we really do need a thread apply all bt showing all 3 threads' stacks at the point where the deadlock has occurred in order to figure out what's going on. Or a reliable reproducer - maybe you can set up a Dockerfile that's able to reproduce the issue?

hdante commented 2 months ago

Hello, everyone, here follows another gdb backtrace, this time with all 3 threads:

(base) [henrique.almeida@loginapl01 testlephare]$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11]
(no debugging symbols found)...done.
(reverse-i-search)`': ^CQuit
(gdb) 
(gdb) run /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Starting program: /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_typing.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblz4-af1653fb.so.1.8.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/02/bd76e3669fdcbd9723caa978eda65336967338.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libunwind-d5b5a318.so.8.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/ae2c0f7bdf5eef40a96a35ea99f6555bf684e2.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libstdc++.so.6]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libgcc_s.so.1]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libssl-cd1d6220.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/7c/bcb0322f585236d81b557ed95c708f98a20c33.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcrypto-d3570994.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9b/03a5571e128c07bf955eb2bd8155311295ae50.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblzma-004595ca.so.5.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f7/44796ea6fa4d8019414f5487ee85de74dc5adc.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libgssapi_krb5-497db0c6.so.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/0c/aec124d97114da40ddeb0fed1fad5d14c3d626.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5-fcafa220.so.3.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/52/c5b6279da9cd210e5d58d1b1e0e080e5c8b232.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcom_err-2abe824b.so.2.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2a/ddb65846a50ce45f0c9b62ea35dda62c6ad7a2.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libk5crypto-b1f99d5c.so.3.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/f9d1075a8d5d62f77f5ce56c935fcd92c62efa.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5support-d0bcff84.so.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/77/9381063daecc27e8480c8f79f0651162586478.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkeyutils-dfe70bd6.so.1.5
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/01d5ac08c1280d013aab96b292ac58bc30a263.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libselinux-0922c95c.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/90/3a0bd0bfb4fee8c284f41beb9773ded94cbc52.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libpcre-9513aab5.so.1.2.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9c/a3d11f018beeb719cdb34be800bf1641350d0a.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/math.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_datetime.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_opcode.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/mmap.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/zlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/termios.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bisect.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_random.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_sha512.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/array.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_decimal.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bz2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/../../libbz2.so.1.0]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lzma.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_socket.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/select.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/fcntl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_posixsubprocess.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_contextvars.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_heapq.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ssl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/binascii.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/unicodedata.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_uuid.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_hashlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_json.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_csv.cpython-311-x86_64-linux-gnu.so]
Writing profile results into memray.bin
[New Thread 0x7fffe152d700 (LWP 54672)]
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lsprof.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/resource.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/pyexpat.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_queue.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/grp.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/mkl/../../../libiomp5.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/82/88a89ba4a984ed547d3ee90f99fda1d695461e.debug
[New Thread 0x7fffba3ff700 (LWP 54714)]
^C
Program received signal SIGINT, Interrupt.
0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) thread apply all bt

Thread 3 (Thread 0x7fffba3ff700 (LWP 54714)):
#0  0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffed6d0eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2  0x00007fffed6d0daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffed8f50cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4  0x00007fffed8fa9f8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#5  0x00007fffe643593b in thread_id () at src/memray/_memray/tracking_api.cpp:70
#6  memray::tracking_api::Tracker::registerThreadNameImpl(char const*) ()
    at src/memray/_memray/tracking_api.cpp:965
#7  0x00007fffe6422a4f in registerThreadName (name=0x7fffbd5f5454 "jemalloc_bg_thd")
    at src/memray/_memray/tracking_api.h:289
#8  memray::intercept::prctl(int, ...) () at src/memray/_memray/hooks.cpp:464
#9  0x00007fffed6d77bf in pthread_setname_np () from /lib64/libpthread.so.0
#10 0x00007fffbcf977b1 in background_thread_entry () at src/background_thread.c:432
#11 0x00007fffed6ceea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fffecceeb0d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fffe152d700 (LWP 54672)):
#0  0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffe642e433 in __gthread_mutex_lock (__mutex=0x97b990)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4  lock (this=0x97b990) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5  lock_guard (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:159
#6  memray::tracking_api::Tracker::BackgroundThread::captureMemorySnapshot() ()
    at src/memray/_memray/tracking_api.cpp:692
#7  0x00007fffe642ec12 in operator() (__closure=<optimized out>) at src/memray/_memray/tracking_api.cpp:722
#8  __invoke_impl<void, memray::tracking_api::Tracker::BackgroundThread::start()::<lambda()> > (
    __f=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:60
#9  __invoke<memray::tracking_api::Tracker::BackgroundThread::start()::<lambda()> > (__fn=<optimized out>)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:95
#10 _M_invoke<0> (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:264
#11 operator() (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:271
#12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<memray::tracking_api::Tracker::BackgroundThread::start()::{lambda()#1}> > >::_M_run() () at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:215
#13 0x00007fffe6461d40 in execute_native_thread_routine ()
   from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/_memray.cpython-311-x86_64-linux-gnu.so
#14 0x00007fffed6ceea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fffecceeb0d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fffedaee740 (LWP 54636)):
#0  0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffe6420c8a in __gthread_mutex_lock (__mutex=<optimized out>)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4  lock (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5  lock (this=0x7fffffff7c90) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:138
#6  unique_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:68
#7  trackAllocation (func=memray::hooks::CALLOC, size=1040, ptr=0x15ff9b0)
    at src/memray/_memray/tracking_api.h:225
#8  memray::intercept::calloc(unsigned long, unsigned long) () at src/memray/_memray/hooks.cpp:229
#9  0x00007fffecc29ec4 in __new_exitfn () from /lib64/libc.so.6
#10 0x00007fffecc29f49 in __cxa_atexit_internal () from /lib64/libc.so.6
#11 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
---Type <return> to continue, or q <return> to quit---
#14 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007fffed4c3fab in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fffed4c45ad in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007fffed4c4041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#19 0x00000000005fab46 in _PyImport_FindSharedFuncptr ()
#20 0x00000000005f9624 in _imp_create_dynamic ()
#21 0x000000000052ef5b in cfunction_vectorcall_FASTCALL ()
#22 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#23 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#24 0x0000000000527157 in object_vacall ()
#25 0x000000000055659c in PyObject_CallMethodObjArgs ()
#26 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#27 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#28 0x00000000005cbeda in _PyEval_Vector ()
#29 0x00000000005cb5af in PyEval_EvalCode ()
#30 0x00000000005e4d83 in builtin_exec ()
#31 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#32 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#33 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#34 0x0000000000527157 in object_vacall ()
#35 0x000000000055659c in PyObject_CallMethodObjArgs ()
#36 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#37 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#38 0x00000000005cbeda in _PyEval_Vector ()
#39 0x00000000005cb5af in PyEval_EvalCode ()
#40 0x00000000005e4d83 in builtin_exec ()
#41 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#42 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#43 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#44 0x0000000000527157 in object_vacall ()
#45 0x000000000055659c in PyObject_CallMethodObjArgs ()
#46 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#47 0x000000000056a800 in builtin___import__ ()
#48 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#49 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#50 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#51 0x0000000000527157 in object_vacall ()
#52 0x000000000055659c in PyObject_CallMethodObjArgs ()
#53 0x0000000000556301 in PyImport_ImportModuleLevelObject.localalias ()
#54 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#55 0x00000000005cbeda in _PyEval_Vector ()
#56 0x00000000005cb5af in PyEval_EvalCode ()
#57 0x00000000005e4d83 in builtin_exec ()
#58 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#59 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#60 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#61 0x0000000000527157 in object_vacall ()
#62 0x000000000055659c in PyObject_CallMethodObjArgs ()
#63 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#64 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#65 0x00000000005cbeda in _PyEval_Vector ()
#66 0x00000000005cb5af in PyEval_EvalCode ()
#67 0x00000000005e4d83 in builtin_exec ()
#68 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#69 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#70 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#71 0x0000000000527157 in object_vacall ()
#72 0x000000000055659c in PyObject_CallMethodObjArgs ()
#73 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#74 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#75 0x00000000005cbeda in _PyEval_Vector ()
---Type <return> to continue, or q <return> to quit---
#76 0x00000000005cb5af in PyEval_EvalCode ()
#77 0x00000000005e4d83 in builtin_exec ()
#78 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#79 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#80 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#81 0x0000000000527157 in object_vacall ()
#82 0x000000000055659c in PyObject_CallMethodObjArgs ()
#83 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#84 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#85 0x00000000005cbeda in _PyEval_Vector ()
#86 0x00000000005cb5af in PyEval_EvalCode ()
#87 0x00000000005e4d83 in builtin_exec ()
#88 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#89 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#90 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#91 0x0000000000527157 in object_vacall ()
#92 0x000000000055659c in PyObject_CallMethodObjArgs ()
#93 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#94 0x000000000056a800 in builtin___import__ ()
#95 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#96 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#97 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#98 0x0000000000527157 in object_vacall ()
#99 0x000000000055659c in PyObject_CallMethodObjArgs ()
#100 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#101 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#102 0x00000000005cbeda in _PyEval_Vector ()
#103 0x00000000005cb5af in PyEval_EvalCode ()
#104 0x00000000005e4d83 in builtin_exec ()
#105 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#106 0x000000000051e6a1 in PyObject_Vectorcall ()
#107 0x00000000004315b3 in _PyEval_EvalFrameDefault.cold ()
#108 0x00000000005cbeda in _PyEval_Vector ()
#109 0x00000000005cb5af in PyEval_EvalCode ()
#110 0x00000000005ec6a7 in run_eval_code_obj ()
#111 0x00000000005e8240 in run_mod ()
#112 0x00000000005fd192 in pyrun_file ()
#113 0x00000000005fc55f in _PyRun_SimpleFileObject.localalias ()
#114 0x00000000005fc283 in _PyRun_AnyFileObject.localalias ()
#115 0x00000000005f6efe in Py_RunMain.localalias ()
#116 0x00000000005bbc79 in Py_BytesMain ()
#117 0x00007fffecc12555 in __libc_start_main () from /lib64/libc.so.6
#118 0x00000000005bbac3 in _start ()
hdante commented 2 months ago

I've tried reproducing this on a RHEL 7.9 machine, and wasn't able to, either. I get to the same point as @pablogsal reached:

Hello, Matt, yes, I'm afraid you'll need lephare from the master branch for that to work.

hdante commented 2 months ago

@hdante FWIW, the strace log isn't very useful. strace shows only system calls, and the entire idea behind futexes - "fast userspace mutexes" - is that they typically don't need to make any system calls, and can get all the work they need to do done in userspace. Since there's no system call when there's no contention, we can't figure anything out about what locks are held by what threads at any given point in time from the strace output.

I think we really do need a thread apply all bt showing all 3 threads' stacks at the point where the deadlock has occurred in order to figure out what's going on. Or a reliable reproducer - maybe you can set up a Dockerfile that's able to reproduce the issue?

The given strace log is useful because I fell on the case where the fast userspace path was not taken and it's guaranteed by the 2 futex system calls on separate threads that they are the cause of the deadlock. Posting only the 2 relevant lines:

41784 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
(...)
41793 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>

I've posted the backtrace for the 3 threads, more library calls are visible right now. Meanwhile, I'll try to find the line in the code where the deadlock happens. If the log is not enough, I can make the docker image.

godlygeek commented 2 months ago

it's guaranteed by the 2 futex system calls on separate threads that they are the cause of the deadlock

I don't think so - I'm pretty sure only one of those two threads is involved in causing the deadlock (I think these two futex calls correspond to the main thread and the memray::tracking_api::Tracker::BackgroundThread thread, but the memray::tracking_api::Tracker::BackgroundThread hasn't done anything wrong, it's a victim of a deadlock between the main thread and the jemalloc_bg_thd thread - I believe these 2 futex calls show the 2 threads that are waiting for the lock to be released, but not the thread that actually acquired it)

here follows another gdb backtrace, this time with all 3 threads:

This was enough to figure out what's going on, though.

It's almost certainly related to the changes we made in https://github.com/bloomberg/memray/pull/525 - if that theory is right, I believe the problem will go away if you downgrade memray to 1.11.0 - can you try that for me and confirm if it fixes the issue for you?

And, if it doesn't, please grab a thread apply all bt again - the worst case scenario would be if it still deadlocks, but deadlocks in a different place...

hdante commented 2 months ago

Ok, so to make sure I understood, the proposed explanation is that there are 2 futexes being simultaneously grabbed in reversed order by 2 threads ?

I'll check with the older version and return. Which version did you use when the program worked ?

godlygeek commented 2 months ago

Which version did you use when the program worked ?

The program has been working for Pablo and I with the latest version of Memray. There must be something else that's different about your environment than ours - maybe glibc version, or something? - that's stopping us from reproducing it. I was optimistic that using glibc 2.17 would do the trick, but I wasn't able to reproduce it even with that, so there's still some variable we haven't identified. It may even be a plain old race condition, and you're just consistently managing to nail the right timing and we're not...

the proposed explanation is that there are 2 futexes being simultaneously grabbed in reversed order by 2 threads ?

Yep, it's a lock ordering deadlock - one thread is grabbing a lock inside glibc and then trying to acquire a lock inside memray, and the other thread is grabbing the lock inside memray and then trying to acquire the lock inside glibc.

hdante commented 2 months ago

Yes, memray 1.11.0 works, the function finishes, lephare starts estimation, no deadlocks.

godlygeek commented 2 months ago

OK, thanks for the confirmation. In that case, you've got a workaround, and we'll figure out how to fix this going forward. Most likely we'll wind up reverting #525 and resurrecting #549 instead - we waffled between those two approaches, but it seems like the latter might be safer, albeit a bit hackier - it forces us to do part of the loader's job...

Thanks for your patience in helping us track this down!

hdante commented 2 months ago

Thank you all for the help !

godlygeek commented 1 month ago

We've released a fix in Memray 1.14 - if you get a chance, @hdante, please make sure that version works for you!

hdante commented 1 month ago

Ok, thank you, I'll test it later this week.

hdante commented 1 month ago

Hello, Pablo, memray 1.14 is working fine with lephare without deadlocks, thank you very much!

pablogsal commented 1 month ago

Awesome! Thanks for opening the issue!