Open y3llowcake opened 3 years ago
Thanks for reporting the issue. Our team started to look at this issue. We will get back to you soon.
Curious if you have any intuition about what might be causing this. Based on other crashes, we think we might be victims to subtle memory corruption bugs in HHVM, but given the frequency and predictability of these stack traces I am more inclined to think the bug is in squangle.
I am curious if the following commit is related to this issue: https://github.com/facebook/squangle/commit/9737cfd6ddf4bdfa3e1fac26659c03ba8d6d5e54
Yes, the commit you mentioned was meant to fix rare segfault happening in connection cleanup, which looks related to this issue.
Yep, cherry-picking that commit fixed this issue for Slack :)
We (slack) have been seeing a very slow trickle of segfaults from code paths in the cleanup timer in our production environment. This issue is not new, it's been occurring for a while. I do not yet have a repro for these segfaults.
Relevant version of squangle we are running:
There are two unique stack traces we see. The first is more frequent and appears to occur on a call to std::unordered_map::erase():
The second one looks like use of an invalid map iterator ('ref_iter->second' where ref_iter is probably pointing to end()?) :
Additionally noteworthy is our non standard hacklang pool configuration: