elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
103 stars 4.92k forks source link

[Auditbeat] Flaky test_metricsets.Test.test_metricset_package #10633

Closed ruflin closed 5 years ago

ruflin commented 5 years ago

Flaky Test

Saw this failing once in master. Artifacts are attached

test_metricsets.Test.test_metricset_package.zip

cwurm commented 5 years ago

The Auditbeat log file (auditbeat.log) suddenly cuts off. The system test log shows the exit code was -1, implying SIGHUP. Beats only handles SIGINT and SIGTERM:

https://github.com/elastic/beats/blob/97920c9e68794131894c8446b31ba20578456d24/libbeat/service/service.go#L47

I tried sending SIGHUP locally - the log file looks exactly the same.

Question now is, why is it getting SIGHUP?

In any case, maybe we should treat SIGHUP the same as SIGINT and SIGTERM so Beats shuts down gracefully?

cwurm commented 5 years ago

Forget what I said above. The return code is 1, not -1. I think I know what is happening though, and it does not look pretty.

I can reproduce the error locally on CentOS 7. It currently happens every time for me, I'm surprised it didn't fail the PR build.

What seems to be happening is that librpm installs signal traps for various UNIX signals, including SIGINT and SIGTERM (here). This overrides the existing ones in Beats. When Auditbeat is terminated (the system test sends SIGTERM), librpm cleans up its open RPM transaction and calls exit(1) (that's why the exit code is 1, here). The process exits, Beats never gets a chance to run its shutdown code.

This has been a problem for other applications before, e.g. there is this bug report from gdb. Following that, librpm added a way to disable its signal traps altogether (here). Unfortunately, that is not yet available in the default librpm version on CentOS 7 (or 6, for that matter).

What we can do (and what gdb ended up doing) is disable the signal traps after they are set. I have this patch that does this and eliminates the test failures on my local system.

It's not ideal, for two reasons I think:

  1. While we would disable the signal traps right after they are set, there is a very short moment where they are set. We're only talking maybe a few milliseconds, and it's only relevant if one of those signals is received (i.e. Beats is shutting down).
  2. Disabling the clean up code of a library is not a good idea in general. There is a reason librpm has this. We can do our own clean up (and we do, deferring rpmdbFreeIterator()), but it's not ideal and potentially error-prone. I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

@tsg @andrewkroh - what do you think about this?

andrewkroh commented 5 years ago

Would it possible to use rpmsqSetInterruptSafety if it's available then fall-back to the manually removing the signal handler?

tsg commented 5 years ago

I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the rpm process while running, right? Given how many users rpm has, I'd say the risk of it is overly small. So I'm +1 on your patch, thanks for chasing this down!

cwurm commented 5 years ago

I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the rpm process while running, right? Given how many users rpm has, I'd say the risk of it is overly small.

I've tried to simulate this by inserting a long time.Sleep() after acquiring all RPM data structures, but before freeing any of them up. Even in that time, yum was happy installing and removing packages. So I think we're ok. PR coming.