meshcat_manual_test segfaults on exit on mac when a browser is still connected

RussTedrake commented 3 years ago

To reproduce, run bazel run //geometry:meshcat_manual_test, point a browser to the meshcat url, and then press enter through the rest of the program. It will end with...

Now we'll run the simulation (you should see the robot fall down).
I've added a button and a slider to the controls menu.
- Click the ButtonTest button a few times.
- Move SliderTest slider.
- Open a second browser (http://localhost:7004) and confirm that moving the slider in one updates the slider in the other.
[Press RETURN to continue].

Got 0 clicks on ButtonTest.
Got 0.5 value for SliderTest.
Exiting...
zsh: segmentation fault  bazel run //geometry:meshcat_manual_test

RussTedrake commented 3 years ago

I spent some time rolling back in my git commits, and find that I get this error even when I checkout the original commit 705d6da471ac423677cedaa400f1e63f2677b20b (where I know it didn't fail at the time, because @EricCousineau-TRI and I had iterated on it during review). It still segfaults on my mac.

Perhaps the mac (now big sur) and compiler have moved forward, causing this issue?
Perhaps git checkout + bazel run are not enough for my roll-backs? (running install_prereqs on the old branch, as expected, did not roll anything back.)

It's proving hard to debug since I can't run the memcheck on it, and it does not segfault on ubuntu (where I can run in debug mode).

rpoyner-tri commented 3 years ago

FWIW, I tried some weak version of this (imitate browser with wget) on macsim (which now has Xcode 13) and did not see the segfault. I can probably host multiple versions of Xcode if necessary; what one did you test against?

rpoyner-tri commented 3 years ago

Nm, it definitely segfaults if I connect a browser and leave it open; macsim with big sur 11.6 (iirc) and Xcode 13.

RussTedrake commented 3 years ago

@rpoyner-tri -- thanks for looking!

I finally found a way to debug, by running command-line lldb on mac (i don't typically debug on mac because I haven't successfully made lldb and vscode work together yet): bazel build --config=apple_debug //geometry:meshcat_manual_test && lldb bazel-bin/geometry/meshcat_manual_test

I wish I understood more deeply why, but when I call ws->close() it seems to destroy the iterator in my std::set<WebSocket*>. Will PR the fix now.

SeanCurtis-TRI commented 3 years ago

FTR My best guess is that it's not the act of calling ws->Close() that messes up your iterator. I believe it's a simple race condition. It all comes down to loop_->defer(). Your telling the library to iterate through a structure in another thread at some other time, but the structure gets deleted in this thread. I assume that this thread isn't blocking on the closure of the sockets and deletes the set and then the other thread trying to loop through blows up.

This may be completely wrong and there are certain artifacts that suggest this is incorrect (e.g., presence of a call to join()). I haven't fully grokked all of the moving parts to know this isn't the case, but a destructor that dispatches work for another thread to operate on the data destroyed by that destructor seems ripe for problems.

It's worth noting that in the code you have the following warning:

These pointers should only be accessed in the main thread, but the objects they are pointing to should be only used in the websocket thread.

applied to (among others) websockets_ and listen_socket_.

However, by passing the whole for loop into the loop_->defer() you are implicitly confounding that warning.

RobotLocomotion / drake

meshcat_manual_test segfaults on exit on mac when a browser is still connected #15821