Leak Sanitizer Segfaulting in CI

wernerpe commented 2 days ago

What happened?

We found confusing behavior when looking into one of the tests run in CI that ran the leak sanitation. We found that in the planning directory, the visibility graph test and the iris zo test (#22168 ) would fail sporadically in CI when running 'linux-jammy-clang-bazel-experimental-leak-sanitizer'. To reproduce the error locally we ran something like

bazel test --runs_per_test=10 --config=clang --compilation_mode=dbg --config=lsan //planning:visibility_graph_test

on ubuntu 22.04 and found typically 2-3/10 runs would produce the segfault. From what we could tell, the segfault gets tripped before entering the test body and only when more than a single thread was requested.

The commit sha i have added below points to a commit on my fork of drake (from which I have opened the pr #22168 ).

Version

34437bc4957b96eb7ebfdf5421646622c3aa7d56

What operating system are you using?

Ubuntu 22.04

What installation option are you using?

No response

Relevant log output

No response

calderpg-tri commented 2 days ago

Some additional information:

segfaults are reproducible both with and without OpenMP enabled in build
segfaults are reproducible with only one thread specified (e.g. DRAKE_NUM_THREADS=1)
the only segfault backtraces @sammy-tri and I could reliably produce were deep in LSAN startup code, well before any log messages were printed (which is well before anything parallel gets run in either test)

calderpg-tri commented 2 days ago

I was able to reproduce the segfault on 24.04 using clang-15. Switching to clang-18 on 24.04, I was unable to reproduce the segfault in 1000 runs of the test. I am inclined to say this is a LSAN bug.

RobotLocomotion / drake