Closed ivoanjo closed 1 week ago
Benchmark execution time: 2024-10-25 15:27:13
Comparing candidate commit dd111b0 in PR branch ivoanjo/prof-10656-crash-address-in-report
with baseline commit 085a91a in branch main
.
Found 0 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 2 unstable metrics.
Omitted due to size.
Attention: Patch coverage is 12.50000%
with 35 lines
in your changes missing coverage. Please review.
Project coverage is 71.77%. Comparing base (
085a91a
) to head (dd111b0
).
Hmm... there's still two tests that hang and time out in CI:
test test_crash_tracking_bin_debug_stdin ... FAILED
test test_crash_tracking_bin_debug_unix_socket ... FAILED
I have no idea how to debug these tests :/ . Anyone has any hints on how to find what's failing?
Renamed crash_address
to faulting_address
in https://github.com/DataDog/libdatadog/pull/663/commits/0d4bb98e732dce251a1d06a1ba5fa35b116fff4a . There's still 2 tests failing, I'm still looking into them.
Unfortunately I wasn't able yet to figure out what's up with the failing tests... I've ping'd @danielsn to see if he has any idea on how to fix those.
Unfortunately I wasn't able yet to figure out what's up with the failing tests... I've ping'd @danielsn to see if he has any idea on how to fix those.
FWIW I tried those tests locally on an x86_64 Alpine image on the commit this PR is based on (67d4f7a) and main
(a76c5ad as of this writing) and they still fail. If I comment out these lines they pass. Strangely though, the tests seem to be passing on main
and a recent PR. So... I don't know why they fail, but I don't think this PR breaks them?
FWIW I tried those tests locally on an x86_64 Alpine image on the commit this PR is based on (https://github.com/DataDog/libdatadog/commit/67d4f7a19afd7f7301423e400008d42cc9284c04) and main (https://github.com/DataDog/libdatadog/commit/a76c5ade40c8ac83418fc7ed6bb2f319a874b327 as of this writing) and they still fail. If I comment out these lines they pass. Strangely though, the tests seem to be passing on main and a https://github.com/DataDog/libdatadog/pull/680. So... I don't know why they fail, but I don't think this PR breaks them?
Thanks for giving it a stab! I wonder if the issue is still related to some of the test assertions, but I've found these tests are a bit weird... when they fail they don't give a lot of info; some of the fixes in this PR were kinda trial-and-error. :eyes:
(I've also merged main into this PR, to make sure it doesn't fall behind)
I'm... still stuck on this one. I suspect it may be a real issue; specifically, I see the issue going away if I replace
if !sig_info.is_null() && (signum == libc::SIGSEGV || signum == libc::SIGBUS) {
unsafe { Some((*sig_info).si_addr() as usize) }
}
with something else that does not touch the sig_info. I suspect it may have to do something with chaining the signal handlers. :(
Ok so after investigation it turns out the failure in the tests was indeed a real issue -- that was caused by us being so close to the altstack limit in some cases that the small changes in this PR were the "straw that broke the camel's back".
Now that https://github.com/DataDog/libdatadog/pull/693 is merged, I've merged master into this PR and CI should finally be green.
What does this PR do?
While investigating a customer crash, I found that I dearly missed having the address that caused a segfault.
E.g. given this code:
...I want to know that the crash was triggered when 0x1234 was dereferenced, as this information may be key in debugging the issue.
We already get this information in our signal handler, so this PR just adds the needed piping to report it to datadog.
Motivation:
Improve debugging of crashes reported by crashtracker.
Additional Notes:
I added the
faulting_address
as a tag, similar tosiginfo
andsigname
. We may want to send it as some other thing; no strong opinions there.How to test the change?
I was able to test this with the Ruby profiler. I've been trying to update the tests but there's still some failing and I'm kinda struggling with that. Hopefully someone with a bit more rust-fu than me can help out there.
Here's a submitted crash report with a
crashing_address
tag (renamed since this test): https://app.datadoghq.com/dashboard/uwt-n4k-y9a?fromUser=true&refresh_mode=sliding&tpl_var_org_id%5B0%5D=197728&tpl_var_runtime%5B0%5D=ruby&from_ts=1727956296556&to_ts=1727959896556&live=true