Closed ruflin closed 5 years ago
The Auditbeat log file (auditbeat.log
) suddenly cuts off. The system test log shows the exit code was -1
, implying SIGHUP
. Beats only handles SIGINT
and SIGTERM
:
I tried sending SIGHUP locally - the log file looks exactly the same.
Question now is, why is it getting SIGHUP?
In any case, maybe we should treat SIGHUP the same as SIGINT and SIGTERM so Beats shuts down gracefully?
Forget what I said above. The return code is 1
, not -1
. I think I know what is happening though, and it does not look pretty.
I can reproduce the error locally on CentOS 7. It currently happens every time for me, I'm surprised it didn't fail the PR build.
What seems to be happening is that librpm installs signal traps for various UNIX signals, including SIGINT and SIGTERM (here). This overrides the existing ones in Beats. When Auditbeat is terminated (the system test sends SIGTERM), librpm cleans up its open RPM transaction and calls exit(1)
(that's why the exit code is 1
, here). The process exits, Beats never gets a chance to run its shutdown code.
This has been a problem for other applications before, e.g. there is this bug report from gdb. Following that, librpm added a way to disable its signal traps altogether (here). Unfortunately, that is not yet available in the default librpm version on CentOS 7 (or 6, for that matter).
What we can do (and what gdb ended up doing) is disable the signal traps after they are set. I have this patch that does this and eliminates the test failures on my local system.
It's not ideal, for two reasons I think:
rpmdbFreeIterator()
), but it's not ideal and potentially error-prone. I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?@tsg @andrewkroh - what do you think about this?
Would it possible to use rpmsqSetInterruptSafety
if it's available then fall-back to the manually removing the signal handler?
I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?
That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the rpm
process while running, right? Given how many users rpm
has, I'd say the risk of it is overly small. So I'm +1 on your patch, thanks for chasing this down!
I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?
That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the
rpm
process while running, right? Given how many usersrpm
has, I'd say the risk of it is overly small.
I've tried to simulate this by inserting a long time.Sleep()
after acquiring all RPM data structures, but before freeing any of them up. Even in that time, yum
was happy installing and removing packages. So I think we're ok. PR coming.
Flaky Test
Saw this failing once in master. Artifacts are attached
test_metricsets.Test.test_metricset_package.zip