Closed RussTedrake closed 3 years ago
I spent some time rolling back in my git commits, and find that I get this error even when I checkout the original commit 705d6da471ac423677cedaa400f1e63f2677b20b (where I know it didn't fail at the time, because @EricCousineau-TRI and I had iterated on it during review). It still segfaults on my mac.
install_prereqs
on the old branch, as expected, did not roll anything back.)It's proving hard to debug since I can't run the memcheck on it, and it does not segfault on ubuntu (where I can run in debug mode).
FWIW, I tried some weak version of this (imitate browser with wget) on macsim
(which now has Xcode 13) and did not see the segfault. I can probably host multiple versions of Xcode if necessary; what one did you test against?
Nm, it definitely segfaults if I connect a browser and leave it open; macsim
with big sur 11.6 (iirc) and Xcode 13.
@rpoyner-tri -- thanks for looking!
I finally found a way to debug, by running command-line lldb on mac (i don't typically debug on mac because I haven't successfully made lldb and vscode work together yet): bazel build --config=apple_debug //geometry:meshcat_manual_test && lldb bazel-bin/geometry/meshcat_manual_test
I wish I understood more deeply why, but when I call ws->close()
it seems to destroy the iterator in my std::set<WebSocket*>
. Will PR the fix now.
FTR My best guess is that it's not the act of calling ws->Close()
that messes up your iterator. I believe it's a simple race condition. It all comes down to loop_->defer()
. Your telling the library to iterate through a structure in another thread at some other time, but the structure gets deleted in this thread. I assume that this thread isn't blocking on the closure of the sockets and deletes the set and then the other thread trying to loop through blows up.
This may be completely wrong and there are certain artifacts that suggest this is incorrect (e.g., presence of a call to join()
). I haven't fully grokked all of the moving parts to know this isn't the case, but a destructor that dispatches work for another thread to operate on the data destroyed by that destructor seems ripe for problems.
It's worth noting that in the code you have the following warning:
These pointers should only be accessed in the main thread, but the objects they are pointing to should be only used in the websocket thread.
applied to (among others) websockets_
and listen_socket_
.
However, by passing the whole for loop into the loop_->defer()
you are implicitly confounding that warning.
To reproduce, run
bazel run //geometry:meshcat_manual_test
, point a browser to the meshcat url, and then press enter through the rest of the program. It will end with...