Open adrienbernede opened 1 month ago
@daboehme It appears that recent changes in Caliper main branch fixed the issues we were seeing with cce compilers. Now remains an issue with rocm 6.2.0 I would like you to look at: https://lc.llnl.gov/gitlab/radiuss/Caliper/-/jobs/2148980 Thank you.
@daboehme any idea what could be causing this ?
5/5 Test #5: CI_app_tests .....................***Failed 45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...
@daboehme any idea what could be causing this ?
5/5 Test #5: CI_app_tests .....................***Failed 45.26 sec ..................................Efree(): double free detected in tcache 2 Efree(): double free detected in tcache 2 E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n E............EEEE....E.....E...
Hi @adrienbernede, where did you see this happening? Can't find it in any of the recent CI results.
@daboehme any idea what could be causing this ?
5/5 Test #5: CI_app_tests .....................***Failed 45.26 sec ..................................Efree(): double free detected in tcache 2 Efree(): double free detected in tcache 2 E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n E............EEEE....E.....E...
Hi @adrienbernede, where did you see this happening? Can't find it in any of the recent CI results.
@daboehme I think you just missed it, it right after the test summary in the logs of the only failing job:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ 2024-10-08 10:15:13-07:00 ~ Testing Caliper
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cannot find file: /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0/DartConfiguration.tcl
Site:
Build name: (empty)
Create new tag: 20241008-1715 - Experimental
Cannot find file: /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0/DartConfiguration.tcl
Test project /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0
Start 1: test-caliper-common
1/5 Test #1: test-caliper-common .............. Passed 0.01 sec
Start 2: test-caliper-reader
2/5 Test #2: test-caliper-reader .............. Passed 0.01 sec
Start 3: test-adiak-services
3/5 Test #3: test-adiak-services .............. Passed 1.13 sec
Start 4: test-caliper
4/5 Test #4: test-caliper ..................... Passed 0.75 sec
Start 5: CI_app_tests
5/5 Test #5: CI_app_tests .....................***Failed 45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...
Hi @adrienbernede, thanks I found it. I tried building Caliper with the same compiler and libraries, but I can't reproduce these issues. All tests are running fine for me. It also doesn't seem like the CI is running this particular configuration lately. Can we simply retry running this config? Maybe it was a HW issue or something.
Hello @daboehme
I ran the job again and it failed the same.
The easiest way to reproduce the issue is by using the in-log reproducer. In each job the CI is set to print a reproducer script. Here it looks like:
working_dir="/usr/workspace/${USER}/Caliper/2222036-$(date +%s)"
mkdir -p ${working_dir} && cd ${working_dir}
git clone https://github.com/LLNL/Caliper.git --single-branch --depth=1
cd Caliper
git fetch origin --depth=1 c634187441c3ad88420de7d00ca642b78dd14da5
git checkout c634187441c3ad88420de7d00ca642b78dd14da5
git submodule update --init --recursive
# Required variables
export MODULE_LIST=""
export SPEC="+tests +rocm amdgpu_target=gfx90a %rocmcc@=6.2.0 ^hip@6.2.0 "
# Allow to set job script for debugging (only this differs from CI)
export DEBUG_MODE=true
flux watch $(flux batch -o output.stdout.type=kvs --nodes=1 --begin-time=+5s ./scripts/gitlab/build-and-test.sh)
Please note that the failing job is new: we were previously testing with rocm@6.1.1 and this PR updates rocm to 6.2.0.
Summary
Supersedes #588
This PR :
⚠️ TODO Before Merge:
.uberenv-config.json
.