Closed hdante closed 1 month ago
Correcting the source code branch:
https://github.com/linea-it/pz-compute/blob/lephare/rail_scripts/rail-estimate#L114
I cannot reproduce this:
python -m memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Error: unable to import program modules.
No module named 'rail.estimation.algos.lephare'
[memray] Successfully generated profile results.
You can now generate reports from the stored allocation records.
Some example commands to generate reports:
/Users/pgalindo3/.pyenv/versions/memray/bin/python -m memray flamegraph memray.bin
memray
finishes correctly here without problems. I don't have access to the SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 file but not sure if that matters...
I tried also after installing pz-rail-lephare
:
memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/Users/pgalindo3/Library/Caches/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/Users/pgalindo3/Library/Caches/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/Users/pgalindo3/Library/Caches/lephare/runs/20240821T181721
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
[memray] Successfully generated profile results.
You can now generate reports from the stored allocation records.
Some example commands to generate reports:
/Users/pgalindo3/.pyenv/versions/3.11.6/envs/memray/bin/python3.11 -m memray flamegraph memray.bin
If there is more setup required, please give us all the instructions required to reproduce the problem, with all the package versions and additional files.
Can you also give us a stack trace using gdb of the pace wheree the hang happens?
Hello, if you have a conda installation available, you may run the install-pz-rail
script on a virtual environment to mirror my environment. If not, see if you can install these packages with pip:
pip install joblib
pip install scikit-learn
pip install scikit-learn-intelex
CFLAGS="-L $CONDA_PREFIX/lib -I $CONDA_PREFIX/include" pip install fitsio
CFLAGS="-O2 -DNDEBUG -std=gnu99" pip install pytdigest
pip install flexcode
# Install RAIL after all dependencies are installed
pip install pz-rail
pip install pz-rail-bpz
pip install pz-rail-flexzboost
pip install pz-rail-tpz
pip install pz-rail-gpz-v1
pip install pz-rail-lephare
I'll attach the input and calibration files here so that you may execute the program until the end, but it doesn't really matter for triggering the deadlock, which happens before the main estimation process.
objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5.gz estimator_lephare.pkl.gz
For the gdb stack trace do you mean attaching to the process after it hangs ?
For the lephare algorithm auxiliary data:
git clone https://github.com/lephare-photoz/lephare-data/ /Users/pgalindo3/Library/Caches/lephare/data
For the gdb stack trace do you mean attaching to the process after it hangs ?
Yeah or run it under gdb
and then Ctrl-C and write bt
.
I still cannot reproduce. The function load_slow_modules
finishes without crashing but then there is some problem with your dependencies:
memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare \
./objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5.gz output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/home/pablogsal/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/pablogsal/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/pablogsal/.cache/lephare/runs/20240821T182524
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
Configuring estimator...
[memray] Successfully generated profile results.
You can now generate reports from the stored allocation records.
Some example commands to generate reports:
/home/pablogsal/.pyenv/versions/3.12.4/envs/memray/bin/python -m memray flamegraph memray.bin
Traceback (most recent call last):
File "/home/pablogsal/.pyenv/versions/memray/bin/memray", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/pablogsal/github/memray/src/memray/commands/__init__.py", line 138, in main
arg_values.entrypoint(arg_values, parser)
File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 339, in run
_run_with_file_output(args)
File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 188, in _run_with_file_output
_run_tracker(
File "/home/pablogsal/github/memray/src/memray/commands/run.py", line 77, in _run_tracker
runpy.run_path(args.script, run_name="__main__")
File "<frozen runpy>", line 286, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "pz-compute/rail_scripts/rail-estimate", line 250, in <module>
main()
File "pz-compute/rail_scripts/rail-estimate", line 245, in main
ctx = setup(cfg)
^^^^^^^^^^
File "pz-compute/rail_scripts/rail-estimate", line 202, in setup
estimator = setup_estimator(
^^^^^^^^^^^^^^^^
File "pz-compute/rail_scripts/rail-estimate", line 172, in setup_estimator
estimator = Estimator.make_stage(**all_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pablogsal/.pyenv/versions/3.12.4/envs/memray/lib/python3.12/site-packages/ceci/stage.py", line 123, in make_stage
stage = cls(kwcopy, comm=comm, aliases=aliases)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LephareEstimator.__init__() got an unexpected keyword argument 'aliases'
but this happens after the point you mention so I am not sure what can be wrong here. I'm going to need some reproducer in a Dockerfile or your gdb trace
I'm sorry, you're right, I've installed lephare from the master branch. I have asked for a new release, but I think it's not there yet. The original issue is here:
https://github.com/lephare-photoz/lephare/issues/178#issuecomment-2297310118
So, confirming the exact lephare version used: 0.1.11.dev4+gff46ff8
Since your calls to load_slow_modules()
are always succeeding, I think I'll need to better isolate the exact point of the deadlock. Meanwhile, here follows the gdb backtrace:
(base) [henrique.almeida@loginapl01 testlephare]$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11]
(no debugging symbols found)...done.
(gdb) run /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Starting program: /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_typing.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblz4-af1653fb.so.1.8.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/02/bd76e3669fdcbd9723caa978eda65336967338.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libunwind-d5b5a318.so.8.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/ae2c0f7bdf5eef40a96a35ea99f6555bf684e2.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libstdc++.so.6]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libgcc_s.so.1]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libssl-cd1d6220.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/7c/bcb0322f585236d81b557ed95c708f98a20c33.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcrypto-d3570994.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9b/03a5571e128c07bf955eb2bd8155311295ae50.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblzma-004595ca.so.5.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f7/44796ea6fa4d8019414f5487ee85de74dc5adc.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libgssapi_krb5-497db0c6.so.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/0c/aec124d97114da40ddeb0fed1fad5d14c3d626.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5-fcafa220.so.3.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/52/c5b6279da9cd210e5d58d1b1e0e080e5c8b232.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcom_err-2abe824b.so.2.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2a/ddb65846a50ce45f0c9b62ea35dda62c6ad7a2.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libk5crypto-b1f99d5c.so.3.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/f9d1075a8d5d62f77f5ce56c935fcd92c62efa.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5support-d0bcff84.so.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/77/9381063daecc27e8480c8f79f0651162586478.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkeyutils-dfe70bd6.so.1.5
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/01d5ac08c1280d013aab96b292ac58bc30a263.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libselinux-0922c95c.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/90/3a0bd0bfb4fee8c284f41beb9773ded94cbc52.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libpcre-9513aab5.so.1.2.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9c/a3d11f018beeb719cdb34be800bf1641350d0a.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/math.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_datetime.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_opcode.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/mmap.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/zlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/termios.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bisect.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_random.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_sha512.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/array.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_decimal.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bz2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/../../libbz2.so.1.0]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lzma.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_socket.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/select.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/fcntl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_posixsubprocess.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_contextvars.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_heapq.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ssl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/binascii.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/unicodedata.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_uuid.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_hashlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_json.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_csv.cpython-311-x86_64-linux-gnu.so]
Writing profile results into memray.bin
[New Thread 0x7fffe152d700 (LWP 27879)]
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lsprof.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/resource.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/pyexpat.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_queue.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/grp.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/mkl/../../../libiomp5.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/82/88a89ba4a984ed547d3ee90f99fda1d695461e.debug
[New Thread 0x7fffba3ff700 (LWP 27913)]
^C
Program received signal SIGINT, Interrupt.
0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) bt
#0 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fffe6420c8a in __gthread_mutex_lock (__mutex=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4 lock (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5 lock (this=0x7fffffff7c90) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:138
#6 unique_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:68
#7 trackAllocation (func=memray::hooks::CALLOC, size=1040, ptr=0x15ffe40) at src/memray/_memray/tracking_api.h:225
#8 memray::intercept::calloc(unsigned long, unsigned long) () at src/memray/_memray/hooks.cpp:229
#9 0x00007fffecc29ec4 in __new_exitfn () from /lib64/libc.so.6
#10 0x00007fffecc29f49 in __cxa_atexit_internal () from /lib64/libc.so.6
#11 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#14 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007fffed4c3fab in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fffed4c45ad in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007fffed4c4041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#19 0x00000000005fab46 in _PyImport_FindSharedFuncptr ()
#20 0x00000000005f9624 in _imp_create_dynamic ()
#21 0x000000000052ef5b in cfunction_vectorcall_FASTCALL ()
#22 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#23 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#24 0x0000000000527157 in object_vacall ()
#25 0x000000000055659c in PyObject_CallMethodObjArgs ()
#26 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#27 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#28 0x00000000005cbeda in _PyEval_Vector ()
#29 0x00000000005cb5af in PyEval_EvalCode ()
#30 0x00000000005e4d83 in builtin_exec ()
#31 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#32 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#33 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#34 0x0000000000527157 in object_vacall ()
#35 0x000000000055659c in PyObject_CallMethodObjArgs ()
#36 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#37 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#38 0x00000000005cbeda in _PyEval_Vector ()
#39 0x00000000005cb5af in PyEval_EvalCode ()
---Type <return> to continue, or q <return> to quit---
#40 0x00000000005e4d83 in builtin_exec ()
#41 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#42 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#43 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#44 0x0000000000527157 in object_vacall ()
#45 0x000000000055659c in PyObject_CallMethodObjArgs ()
#46 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#47 0x000000000056a800 in builtin___import__ ()
#48 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#49 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#50 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#51 0x0000000000527157 in object_vacall ()
#52 0x000000000055659c in PyObject_CallMethodObjArgs ()
#53 0x0000000000556301 in PyImport_ImportModuleLevelObject.localalias ()
#54 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#55 0x00000000005cbeda in _PyEval_Vector ()
#56 0x00000000005cb5af in PyEval_EvalCode ()
#57 0x00000000005e4d83 in builtin_exec ()
#58 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#59 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#60 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#61 0x0000000000527157 in object_vacall ()
#62 0x000000000055659c in PyObject_CallMethodObjArgs ()
#63 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#64 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#65 0x00000000005cbeda in _PyEval_Vector ()
#66 0x00000000005cb5af in PyEval_EvalCode ()
#67 0x00000000005e4d83 in builtin_exec ()
#68 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#69 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#70 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#71 0x0000000000527157 in object_vacall ()
#72 0x000000000055659c in PyObject_CallMethodObjArgs ()
#73 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#74 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#75 0x00000000005cbeda in _PyEval_Vector ()
#76 0x00000000005cb5af in PyEval_EvalCode ()
#77 0x00000000005e4d83 in builtin_exec ()
#78 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#79 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
---Type <return> to continue, or q <return> to quit---
#80 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#81 0x0000000000527157 in object_vacall ()
#82 0x000000000055659c in PyObject_CallMethodObjArgs ()
#83 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#84 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#85 0x00000000005cbeda in _PyEval_Vector ()
#86 0x00000000005cb5af in PyEval_EvalCode ()
#87 0x00000000005e4d83 in builtin_exec ()
#88 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#89 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#90 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#91 0x0000000000527157 in object_vacall ()
#92 0x000000000055659c in PyObject_CallMethodObjArgs ()
#93 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#94 0x000000000056a800 in builtin___import__ ()
#95 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#96 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#97 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#98 0x0000000000527157 in object_vacall ()
#99 0x000000000055659c in PyObject_CallMethodObjArgs ()
#100 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#101 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#102 0x00000000005cbeda in _PyEval_Vector ()
#103 0x00000000005cb5af in PyEval_EvalCode ()
#104 0x00000000005e4d83 in builtin_exec ()
#105 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#106 0x000000000051e6a1 in PyObject_Vectorcall ()
#107 0x00000000004315b3 in _PyEval_EvalFrameDefault.cold ()
#108 0x00000000005cbeda in _PyEval_Vector ()
#109 0x00000000005cb5af in PyEval_EvalCode ()
#110 0x00000000005ec6a7 in run_eval_code_obj ()
#111 0x00000000005e8240 in run_mod ()
#112 0x00000000005fd192 in pyrun_file ()
#113 0x00000000005fc55f in _PyRun_SimpleFileObject.localalias ()
#114 0x00000000005fc283 in _PyRun_AnyFileObject.localalias ()
#115 0x00000000005f6efe in Py_RunMain.localalias ()
#116 0x00000000005bbc79 in Py_BytesMain ()
#117 0x00007fffecc12555 in __libc_start_main () from /lib64/libc.so.6
#118 0x00000000005bbac3 in _start ()
(gdb) info threads
Id Target Id Frame
3 Thread 0x7fffba3ff700 (LWP 27913) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x7fffe152d700 (LWP 27879) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7fffedaee740 (LWP 27867) "python" 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
There are 3 threads in your program, can you print the stack for all 3 of them?
You can switch from thread to thread using the “thread X” command and then type “bt” for every thread.
You can actually just thread apply all bt
to have gdb print the stack for every thread one after another.
I've tried reproducing this on a RHEL 7.9 machine, and wasn't able to, either. I get to the same point as @pablogsal reached:
$ memray run -f -o memray.bin pz-compute/rail_scripts/rail-estimate -a lephare ./objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Writing profile results into memray.bin
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
LEPHAREDIR is being set to the default cache directory:
/home/mwoznisk/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/mwoznisk/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/mwoznisk/.cache/lephare/runs/20240822T134206
Number of keywords read in the config file: 86
Number of keywords read in the config file: 86
Configuring estimator...
[memray] Successfully generated profile results.
You can now generate reports from the stored allocation records.
Some example commands to generate reports:
/tmp/deadlock-repro/venv/bin/python3.11 -m memray flamegraph memray.bin
Traceback (most recent call last):
File "/tmp/deadlock-repro/venv/bin/memray", line 8, in <module>
sys.exit(main())
^^^^^^
File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/__init__.py", line 138, in main
arg_values.entrypoint(arg_values, parser)
File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 339, in run
_run_with_file_output(args)
File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 188, in _run_with_file_output
_run_tracker(
File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/memray/commands/run.py", line 77, in _run_tracker
runpy.run_path(args.script, run_name="__main__")
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "pz-compute/rail_scripts/rail-estimate", line 250, in <module>
main()
File "pz-compute/rail_scripts/rail-estimate", line 245, in main
ctx = setup(cfg)
^^^^^^^^^^
File "pz-compute/rail_scripts/rail-estimate", line 202, in setup
estimator = setup_estimator(
^^^^^^^^^^^^^^^^
File "pz-compute/rail_scripts/rail-estimate", line 172, in setup_estimator
estimator = Estimator.make_stage(**all_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/deadlock-repro/venv/lib/python3.11/site-packages/ceci/stage.py", line 123, in make_stage
stage = cls(kwcopy, comm=comm, aliases=aliases)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LephareEstimator.__init__() got an unexpected keyword argument 'aliases'
@hdante FWIW, the strace log isn't very useful. strace shows only system calls, and the entire idea behind futexes - "fast userspace mutexes" - is that they typically don't need to make any system calls, and can get all the work they need to do done in userspace. Since there's no system call when there's no contention, we can't figure anything out about what locks are held by what threads at any given point in time from the strace output.
I think we really do need a thread apply all bt
showing all 3 threads' stacks at the point where the deadlock has occurred in order to figure out what's going on. Or a reliable reproducer - maybe you can set up a Dockerfile that's able to reproduce the issue?
Hello, everyone, here follows another gdb backtrace, this time with all 3 threads:
(base) [henrique.almeida@loginapl01 testlephare]$ gdb python
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11...Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python3.11]
(no debugging symbols found)...done.
(reverse-i-search)`': ^CQuit
(gdb)
(gdb) run /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
Starting program: /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/python /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/memray run -f -o memray.bin $ALTHOME/bin/rail-estimate -a lephare $SCRATCH/input/hdf5/ref/objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5 output1.hdf5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_typing.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblz4-af1653fb.so.1.8.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/02/bd76e3669fdcbd9723caa978eda65336967338.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libunwind-d5b5a318.so.8.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/ca/ae2c0f7bdf5eef40a96a35ea99f6555bf684e2.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libstdc++.so.6]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/bin/../lib/libgcc_s.so.1]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libssl-cd1d6220.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/7c/bcb0322f585236d81b557ed95c708f98a20c33.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcrypto-d3570994.so.1.0.2k
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9b/03a5571e128c07bf955eb2bd8155311295ae50.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/liblzma-004595ca.so.5.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f7/44796ea6fa4d8019414f5487ee85de74dc5adc.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libgssapi_krb5-497db0c6.so.2.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/0c/aec124d97114da40ddeb0fed1fad5d14c3d626.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5-fcafa220.so.3.3
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/52/c5b6279da9cd210e5d58d1b1e0e080e5c8b232.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libcom_err-2abe824b.so.2.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2a/ddb65846a50ce45f0c9b62ea35dda62c6ad7a2.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libk5crypto-b1f99d5c.so.3.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/f9d1075a8d5d62f77f5ce56c935fcd92c62efa.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkrb5support-d0bcff84.so.0.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/77/9381063daecc27e8480c8f79f0651162586478.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libkeyutils-dfe70bd6.so.1.5
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/01d5ac08c1280d013aab96b292ac58bc30a263.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libselinux-0922c95c.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/90/3a0bd0bfb4fee8c284f41beb9773ded94cbc52.debug
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/../memray.libs/libpcre-9513aab5.so.1.2.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/9c/a3d11f018beeb719cdb34be800bf1641350d0a.debug
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/math.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_datetime.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_opcode.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/mmap.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/zlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/termios.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bisect.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_random.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_sha512.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/array.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_decimal.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_bz2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/../../libbz2.so.1.0]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lzma.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_socket.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/select.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/fcntl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_posixsubprocess.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_contextvars.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_heapq.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ssl.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_struct.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/binascii.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/unicodedata.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_uuid.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_pickle.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_hashlib.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_blake2.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_json.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_csv.cpython-311-x86_64-linux-gnu.so]
Writing profile results into memray.bin
[New Thread 0x7fffe152d700 (LWP 54672)]
Estimator algorithm: lephare
Estimator calibration file: estimator_lephare.pkl
Bins: 301
HDF5 group name: ""
Column template for magnitude data: "mag_{band}"
Column template for error data: "magerr_{band}"
Extra parameter file: "None"
Starting setup.
Loading all program modules...
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_lsprof.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/resource.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/pyexpat.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_queue.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/grp.cpython-311-x86_64-linux-gnu.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so]
Missing separate debuginfo for /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/mkl/../../../libiomp5.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/82/88a89ba4a984ed547d3ee90f99fda1d695461e.debug
[New Thread 0x7fffba3ff700 (LWP 54714)]
^C
Program received signal SIGINT, Interrupt.
0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) thread apply all bt
Thread 3 (Thread 0x7fffba3ff700 (LWP 54714)):
#0 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fffed6d0eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007fffed6d0daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fffed8f50cb in tls_get_addr_tail () from /lib64/ld-linux-x86-64.so.2
#4 0x00007fffed8fa9f8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fffe643593b in thread_id () at src/memray/_memray/tracking_api.cpp:70
#6 memray::tracking_api::Tracker::registerThreadNameImpl(char const*) ()
at src/memray/_memray/tracking_api.cpp:965
#7 0x00007fffe6422a4f in registerThreadName (name=0x7fffbd5f5454 "jemalloc_bg_thd")
at src/memray/_memray/tracking_api.h:289
#8 memray::intercept::prctl(int, ...) () at src/memray/_memray/hooks.cpp:464
#9 0x00007fffed6d77bf in pthread_setname_np () from /lib64/libpthread.so.0
#10 0x00007fffbcf977b1 in background_thread_entry () at src/background_thread.c:432
#11 0x00007fffed6ceea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fffecceeb0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fffe152d700 (LWP 54672)):
#0 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fffe642e433 in __gthread_mutex_lock (__mutex=0x97b990)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4 lock (this=0x97b990) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5 lock_guard (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:159
#6 memray::tracking_api::Tracker::BackgroundThread::captureMemorySnapshot() ()
at src/memray/_memray/tracking_api.cpp:692
#7 0x00007fffe642ec12 in operator() (__closure=<optimized out>) at src/memray/_memray/tracking_api.cpp:722
#8 __invoke_impl<void, memray::tracking_api::Tracker::BackgroundThread::start()::<lambda()> > (
__f=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:60
#9 __invoke<memray::tracking_api::Tracker::BackgroundThread::start()::<lambda()> > (__fn=<optimized out>)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:95
#10 _M_invoke<0> (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:264
#11 operator() (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:271
#12 std::thread::_State_impl<std::thread::_Invoker<std::tuple<memray::tracking_api::Tracker::BackgroundThread::start()::{lambda()#1}> > >::_M_run() () at /opt/rh/devtoolset-10/root/usr/include/c++/10/thread:215
#13 0x00007fffe6461d40 in execute_native_thread_routine ()
from /lustre/t1/cl/lsst/tmp/henrique.almeida/miniconda3/lib/python3.11/site-packages/memray/_memray.cpython-311-x86_64-linux-gnu.so
#14 0x00007fffed6ceea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fffecceeb0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fffedaee740 (LWP 54636)):
#0 0x00007fffed6d554d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fffed6d0e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007fffed6d0d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fffe6420c8a in __gthread_mutex_lock (__mutex=<optimized out>)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#4 lock (this=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#5 lock (this=0x7fffffff7c90) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:138
#6 unique_lock (__m=..., this=<optimized out>, this=<optimized out>, __m=...)
at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_lock.h:68
#7 trackAllocation (func=memray::hooks::CALLOC, size=1040, ptr=0x15ff9b0)
at src/memray/_memray/tracking_api.h:225
#8 memray::intercept::calloc(unsigned long, unsigned long) () at src/memray/_memray/hooks.cpp:229
#9 0x00007fffecc29ec4 in __new_exitfn () from /lib64/libc.so.6
#10 0x00007fffecc29f49 in __cxa_atexit_internal () from /lib64/libc.so.6
#11 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
---Type <return> to continue, or q <return> to quit---
#14 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007fffed4c3fab in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007fffed4c45ad in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007fffed4c4041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#19 0x00000000005fab46 in _PyImport_FindSharedFuncptr ()
#20 0x00000000005f9624 in _imp_create_dynamic ()
#21 0x000000000052ef5b in cfunction_vectorcall_FASTCALL ()
#22 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#23 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#24 0x0000000000527157 in object_vacall ()
#25 0x000000000055659c in PyObject_CallMethodObjArgs ()
#26 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#27 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#28 0x00000000005cbeda in _PyEval_Vector ()
#29 0x00000000005cb5af in PyEval_EvalCode ()
#30 0x00000000005e4d83 in builtin_exec ()
#31 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#32 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#33 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#34 0x0000000000527157 in object_vacall ()
#35 0x000000000055659c in PyObject_CallMethodObjArgs ()
#36 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#37 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#38 0x00000000005cbeda in _PyEval_Vector ()
#39 0x00000000005cb5af in PyEval_EvalCode ()
#40 0x00000000005e4d83 in builtin_exec ()
#41 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#42 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#43 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#44 0x0000000000527157 in object_vacall ()
#45 0x000000000055659c in PyObject_CallMethodObjArgs ()
#46 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#47 0x000000000056a800 in builtin___import__ ()
#48 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#49 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#50 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#51 0x0000000000527157 in object_vacall ()
#52 0x000000000055659c in PyObject_CallMethodObjArgs ()
#53 0x0000000000556301 in PyImport_ImportModuleLevelObject.localalias ()
#54 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#55 0x00000000005cbeda in _PyEval_Vector ()
#56 0x00000000005cb5af in PyEval_EvalCode ()
#57 0x00000000005e4d83 in builtin_exec ()
#58 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#59 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#60 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#61 0x0000000000527157 in object_vacall ()
#62 0x000000000055659c in PyObject_CallMethodObjArgs ()
#63 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#64 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#65 0x00000000005cbeda in _PyEval_Vector ()
#66 0x00000000005cb5af in PyEval_EvalCode ()
#67 0x00000000005e4d83 in builtin_exec ()
#68 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#69 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#70 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#71 0x0000000000527157 in object_vacall ()
#72 0x000000000055659c in PyObject_CallMethodObjArgs ()
#73 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#74 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#75 0x00000000005cbeda in _PyEval_Vector ()
---Type <return> to continue, or q <return> to quit---
#76 0x00000000005cb5af in PyEval_EvalCode ()
#77 0x00000000005e4d83 in builtin_exec ()
#78 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#79 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#80 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#81 0x0000000000527157 in object_vacall ()
#82 0x000000000055659c in PyObject_CallMethodObjArgs ()
#83 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#84 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#85 0x00000000005cbeda in _PyEval_Vector ()
#86 0x00000000005cb5af in PyEval_EvalCode ()
#87 0x00000000005e4d83 in builtin_exec ()
#88 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#89 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#90 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#91 0x0000000000527157 in object_vacall ()
#92 0x000000000055659c in PyObject_CallMethodObjArgs ()
#93 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#94 0x000000000056a800 in builtin___import__ ()
#95 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#96 0x000000000043353c in _PyEval_EvalFrameDefault.cold ()
#97 0x0000000000538cc3 in _PyFunction_Vectorcall ()
#98 0x0000000000527157 in object_vacall ()
#99 0x000000000055659c in PyObject_CallMethodObjArgs ()
#100 0x00000000005560e7 in PyImport_ImportModuleLevelObject.localalias ()
#101 0x00000000005166e7 in _PyEval_EvalFrameDefault ()
#102 0x00000000005cbeda in _PyEval_Vector ()
#103 0x00000000005cb5af in PyEval_EvalCode ()
#104 0x00000000005e4d83 in builtin_exec ()
#105 0x000000000051e7b7 in cfunction_vectorcall_FASTCALL_KEYWORDS ()
#106 0x000000000051e6a1 in PyObject_Vectorcall ()
#107 0x00000000004315b3 in _PyEval_EvalFrameDefault.cold ()
#108 0x00000000005cbeda in _PyEval_Vector ()
#109 0x00000000005cb5af in PyEval_EvalCode ()
#110 0x00000000005ec6a7 in run_eval_code_obj ()
#111 0x00000000005e8240 in run_mod ()
#112 0x00000000005fd192 in pyrun_file ()
#113 0x00000000005fc55f in _PyRun_SimpleFileObject.localalias ()
#114 0x00000000005fc283 in _PyRun_AnyFileObject.localalias ()
#115 0x00000000005f6efe in Py_RunMain.localalias ()
#116 0x00000000005bbc79 in Py_BytesMain ()
#117 0x00007fffecc12555 in __libc_start_main () from /lib64/libc.so.6
#118 0x00000000005bbac3 in _start ()
I've tried reproducing this on a RHEL 7.9 machine, and wasn't able to, either. I get to the same point as @pablogsal reached:
Hello, Matt, yes, I'm afraid you'll need lephare from the master branch for that to work.
@hdante FWIW, the strace log isn't very useful. strace shows only system calls, and the entire idea behind futexes - "fast userspace mutexes" - is that they typically don't need to make any system calls, and can get all the work they need to do done in userspace. Since there's no system call when there's no contention, we can't figure anything out about what locks are held by what threads at any given point in time from the strace output.
I think we really do need a
thread apply all bt
showing all 3 threads' stacks at the point where the deadlock has occurred in order to figure out what's going on. Or a reliable reproducer - maybe you can set up a Dockerfile that's able to reproduce the issue?
The given strace log is useful because I fell on the case where the fast userspace path was not taken and it's guaranteed by the 2 futex system calls on separate threads that they are the cause of the deadlock. Posting only the 2 relevant lines:
41784 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
(...)
41793 futex(0x1a49990, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
I've posted the backtrace for the 3 threads, more library calls are visible right now. Meanwhile, I'll try to find the line in the code where the deadlock happens. If the log is not enough, I can make the docker image.
it's guaranteed by the 2 futex system calls on separate threads that they are the cause of the deadlock
I don't think so - I'm pretty sure only one of those two threads is involved in causing the deadlock (I think these two futex calls correspond to the main thread and the memray::tracking_api::Tracker::BackgroundThread
thread, but the memray::tracking_api::Tracker::BackgroundThread
hasn't done anything wrong, it's a victim of a deadlock between the main thread and the jemalloc_bg_thd
thread - I believe these 2 futex
calls show the 2 threads that are waiting for the lock to be released, but not the thread that actually acquired it)
here follows another gdb backtrace, this time with all 3 threads:
This was enough to figure out what's going on, though.
jemalloc_bg_thd
has acquired a lock used by Memray for serializing access to the output stream, and is trying to acquire a lock inside glibc guarding the set of thread-local variables in the programIt's almost certainly related to the changes we made in https://github.com/bloomberg/memray/pull/525 - if that theory is right, I believe the problem will go away if you downgrade memray to 1.11.0 - can you try that for me and confirm if it fixes the issue for you?
And, if it doesn't, please grab a thread apply all bt
again - the worst case scenario would be if it still deadlocks, but deadlocks in a different place...
Ok, so to make sure I understood, the proposed explanation is that there are 2 futexes being simultaneously grabbed in reversed order by 2 threads ?
I'll check with the older version and return. Which version did you use when the program worked ?
Which version did you use when the program worked ?
The program has been working for Pablo and I with the latest version of Memray. There must be something else that's different about your environment than ours - maybe glibc version, or something? - that's stopping us from reproducing it. I was optimistic that using glibc 2.17 would do the trick, but I wasn't able to reproduce it even with that, so there's still some variable we haven't identified. It may even be a plain old race condition, and you're just consistently managing to nail the right timing and we're not...
the proposed explanation is that there are 2 futexes being simultaneously grabbed in reversed order by 2 threads ?
Yep, it's a lock ordering deadlock - one thread is grabbing a lock inside glibc and then trying to acquire a lock inside memray, and the other thread is grabbing the lock inside memray and then trying to acquire the lock inside glibc.
Yes, memray 1.11.0 works, the function finishes, lephare starts estimation, no deadlocks.
OK, thanks for the confirmation. In that case, you've got a workaround, and we'll figure out how to fix this going forward. Most likely we'll wind up reverting #525 and resurrecting #549 instead - we waffled between those two approaches, but it seems like the latter might be safer, albeit a bit hackier - it forces us to do part of the loader's job...
Thanks for your patience in helping us track this down!
Thank you all for the help !
We've released a fix in Memray 1.14 - if you get a chance, @hdante, please make sure that version works for you!
Ok, thank you, I'll test it later this week.
Hello, Pablo, memray 1.14 is working fine with lephare without deadlocks, thank you very much!
Awesome! Thanks for opening the issue!
Is there an existing issue for this?
Current Behavior
Hello, I'm having a possible deadlock when executing memray and it seems to have appeared on the
strace
output log. The relevant snippet is here:The actual program running is
rail-estimate
in functionload_slow_modules()
, from here:https://github.com/linea-it/pz-compute/blob/main/rail_scripts/rail-estimate#L114
That function recursively loads a large amount of libraries, I can still find out more precisely the point it happened if necessary and if some other library could be causing the deadlock instead.
Expected Behavior
Without running memray, the
load_slow_modules()
function works normally and the python modules are successfully loaded.Steps To Reproduce
Memray Version
1.13.4
Python Version
3.11
Operating System
Linux
Anything else?
Full strace log attached:
trace.gz