elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
131 stars 141 forks source link

Implement Longevity test for Elastic Agent and Beats #3833

Closed jlind23 closed 6 months ago

jlind23 commented 10 months ago

The team recently fixed a bug in the elastic-agent-system-metrics code base where we were continuously increasing our memory consumption on Windows until BSOD. Here is the issue raised by the community - https://github.com/elastic/beats/issues/37142 Here is the fix made by @leehinman - https://github.com/elastic/elastic-agent-system-metrics/pull/115

On way to avoid this to happen again would be to have a cloud stack with Windows, Mac & Linux agents in the cloud for several days after each BC, where data will continuously be ingested from the system. Based on this data we should set up alerts for memory usage.

This test must make sure that the agent is doing representative work. To minimize the expected time to identify resource leaks, we should accelerate the natural rate of data ingestion and change in the system to stress the agent and accelerate any leaks that may be present.

We can do this by:

This will exclude the code paths that interact with Fleet, but we can address that as a follow up. Likely the policy reconfiguration can be driven with our fake fleet server as a future enhancement, with the agent configured to check in rapidly.

We can introduce the leak detection checks we need in phases, tracked in:

elasticmachine commented 10 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

jlind23 commented 10 months ago

@cmacknz @leehinman does the elastic agent testing framework have the necessary infrastructure for this to happened?

cmacknz commented 10 months ago

Hmm, it can orchestrate the test setup (install agent on a VM, creating a stack deployment, etc) but we don't have a concept of a test that runs for hours or days intentionally. This is probably possible though.

We could try writing dedicate soak integration tests triggered on a schedule, that run for a full day and periodically check the metrics and health of the agent doing representative tasks. We can start with the system integration installed and expand from there.

cmacknz commented 10 months ago

The alert would just be the test failing which is a nice way to get an alert about this.

cmacknz commented 10 months ago

One thing that jumped out from https://github.com/elastic/beats/issues/37142 that could allow us to write a more focused test for leaking handles on Windows is that there is a metric for open process handles and it increased over time as this problem was happening.

Here is a screenshot from the linked issue showing this:

283765741-a196efc4-f466-4c6f-8029-7c7988a9a4e7

Likely we can write a much shorter test to detect leaking file and process handles by starting the system integration, running through at least one data collection interval, and verifying that the number of handles is stable.

pazone commented 10 months ago

We can probably use pprof from integration(or another type of tests) to get the metrics and get comparable reports

jlind23 commented 10 months ago

pprof from integration

How feasible would it be to generate a pprof everytime integration tests are run? I guess they will be generated on the provisioned VM and then have to be moved/uploaded to a central place right?

cmacknz commented 10 months ago

How feasible would it be to generate a pprof everytime integration tests are run? I

Our diagnostics collect these at the end of every failing test already.

For determining if there is a leak we don't need pprof, pprof will tell us where and why we have a leak.

Instead we should be able the monitor the memory of the running agent processes. We can do this in a few ways:

mbudge commented 9 months ago

Well going to upgrade to the 4th maintenance release to mitigate this risk, as we’re going to be deploying elastic agent to our windows production network which supports core banking operations over the next few months.

You need to test for longer before a GA elastic agent release. If elastic-agent causes an outage in business operations by crashing servers I’ll get sacked 100%. We’re already banned from using Elastic-Defend because it maxed out CPU on 16 core production banking servers.

We need a strict policy where the 4th maintenance release is fully tested and stable, and every minor release has a 4th maintenance release.

Elastic also needs to add rollback and bug notification into Fleet. Fleet/Elastic agent might be GA but you’re still building the platform so there’s a high risk of more severe bugs in the future.

RicardoCst commented 9 months ago

Well going to upgrade to the 4th maintenance release to mitigate this risk, as we’re going to be deploying elastic agent to our windows production network which supports core banking operations over the next few months.

You need to test for longer before a GA elastic agent release. If elastic-agent causes an outage in business operations by crashing servers I’ll get sacked 100%. We’re already banned from using Elastic-Defend because it maxed out CPU on 16 core production banking servers.

We need a strict policy where the 4th maintenance release is fully tested and stable, and every minor release has a 4th maintenance release.

Elastic also needs to add rollback and bug notification into Fleet. Fleet/Elastic agent might be GA but you’re still building the platform so there’s a high risk of more severe bugs in the future.

Same here we cant use Elastic Defend after we found that one of our production machines was overloaded.

cmacknz commented 9 months ago

Thanks for the feedback, we know we need to be better here.

Elastic also needs to add rollback and bug notification into Fleet. Fleet/Elastic agent might be GA but you’re still building the platform so there’s a high risk of more severe bugs in the future.

This is tracked in https://github.com/elastic/kibana/issues/172745. We know we need this, but we have to work through some of the backwards compatibility implications since if we are going to enable it we want to make sure it works reliably.

We are separately working to enable releasing fixes for Elastic Agent much faster. Ideally we want to be able to ship the fix for problems like https://github.com/elastic/beats/issues/37142 within 24 hours of the fix being made. This is balanced against the need for adequate soak testing as noted.

cmacknz commented 9 months ago

Ideally we want to be able to ship the fix for problems like https://github.com/elastic/beats/issues/37142 within 24 hours of the fix being made. This is balanced against the need for adequate soak testing as noted.

I should specifically note we also know we shouldn't have released this in the first place. In addition to ensuring we can react to unexpected problems faster we are putting more tests, checks, and processes in place to avoid unexpected problems wherever we can.

fearful-symmetry commented 9 months ago

So, I'm assuming we want to start with a "base" configuration (system metrics, logs) and expand from there. Considering the nature of the test, we can and probably should expand to a variety of configs/use cases/stories/whatever.

For the base case, we still have a few questions:

  1. Can we just run this as a cron/scheduled job in buildkite? Are there some timeouts in buildkite if we try to run a 24h job? Buildkite's schedule feature is annoyingly limited. Guess we'll find out.
  2. Aside from the obvious "it blows up" what are the cases that should fail the test, if any? Do we want some threshold for memory/CPU usage as an absolute value or percentage?
  3. Do we have some mechanism for notifying people on failures. Considering how often buildkite goes red, I'm a little worried that some blinking light in some dashboard somewhere is going to get ignored.
  4. Is there some kind of tooling or analysis we should rope in? If this was C/C++ I would say valgrind or something, but I guess @leehinman would know if there's some kind of dynamic analysis tool that would tell us if we're not releasing Windows file handles.
cmacknz commented 9 months ago

Are there some timeouts in buildkite if we try to run a 24h job?

For the purposes of this test we don't need to run for 24h. We need to run for long enough to detect that resource usage is increasing beyond what we expect it to be. I expect that most true leaks will be observable much faster than that, for example over a 1h period memory usage should not be trending up continuously if the system is at steady state in terms of the work we have made agent do.

Aside from the obvious "it blows up" what are the cases that should fail the test, if any? Do we want some threshold for memory/CPU usage as an absolute value or percentage?

The important thing to do is establish a baseline of resource usage and what variation it has. How long do we need to watch agent for resource usage to stabilize? We also need to make sure we are doing representative work, the system integration needs to be installed and continuously ingesting logs and metrics for example.

Once we establish a baseline of resource usage we could then unconditionally fail the test if it exceeds it whatever that is defined to be, this will automatically result in us capturing heap profiles and diagnostics.

A good exercise would be to create a branch which reverts the fix for https://github.com/elastic/beats/issues/37142 or otherwise introduces an obvious resource leak and make sure that whatever conditions you put in place detect it. It could also be valuable to take a closer look at how quickly the open handles and memory usage increases.

For problems related to metrics collection the rate at which we leak handles/memory/whatever is likely to correlate with the metrics collection interval, so we can accelerate that to get the system to fail faster. We can have agent collect metrics every 1s or faster for example.

Do we have some mechanism for notifying people on failures. Considering how often buildkite goes red, I'm a little worried that some blinking light in some dashboard somewhere is going to get ignored.

Continue to use the existing build failure notifications. Cleaning those up is happening as a separate effort.

mbudge commented 9 months ago

Elastic-defend on pre 5.10.16 versions of the linux kernel falls back to a debugger called tracefs instead of using eBPF. We found this by running Elastic-defend on 16 core production linux servers which were built 2-3 years ago (CPU ran between 60-98%)). They get patched but the kernel doesn't get updated. It was either that or monitoring auditd system calls. https://github.com/elastic/ebpf

Elastic-Defend would be good to include in longevity tests, using Windows/Linux servers running under load to replicate production (both new and old Linux kernels).

cmacknz commented 9 months ago

Elastic-Defend would be good to include in longevity tests, using Windows/Linux servers running under load to replicate production (both new and old Linux kernels).

The test suite we use for Elastic Agent here can and does run Elastic Defend. There is also an entirely separate team dedicated purely to defend covering defend specific problems, we primarily focus on ensuring we can install/upgrade/monitor defend through Elastic Agent.

We also have dedicated performance testing infrastructure using our BPF based whole system profiler, but BPF doesn't help on Windows. There is some overlap here with what that infrastructure does, which is more focused on throughput testing than cross-platform functional testing.

fearful-symmetry commented 9 months ago

@cmacknz

So, it sounds to me like we want to roll this out in phases:

1) Run system integration for some period of a few hours, using https://github.com/elastic/beats/issues/37142 as a baseline for time (does anyone know what release that was off the top of their head?). Report CPU/memory usage 2) Add conditional failures for exceeding some memory/CPU watermark 3) Expand beyond system integration, perhaps breaking out into different tests so we can better measure resource usage.

cmacknz commented 9 months ago

Expand beyond system integration,

Yes. We will want this enabled by default for every test we have eventually. Ideally we can find a way to avoid just hard coding the pass/fail thresholds, and come up with something like "total memory usage at the end of the test needs to be within X% of the starting memory usage".

We should have a threshold as well as a sanity check, we don't want to be using 1 GB and 1.1 GB of memory and not notice it doing something simple, but if we depend on having a finly tuned threshold that is going to end up being too flaky.

I would suggest starting with the open handles metric in https://github.com/elastic/elastic-agent/issues/3833#issuecomment-1833892360 and look at memory afterwards. There is no reason that should be increasing continuously. Files we are reading will stay open but they shouldn't be reopened continuously.

I hope that is an easy one to add and would immediately prevent handle leaks on Windows, you should be able to add this and prove it would have detected the previous snapshot handle leak.

fearful-symmetry commented 9 months ago

Yes. We will want this enabled by default for every test we have eventually.

What do you mean by this? Some kind handle/memory metric?

cmacknz commented 9 months ago

What do you mean by this? Some kind handle/memory metric?

Ideally every integration test we have should have a check at the end of it to ensure it didn't leak memory, handles, or any other resource leak we can feasibly detect.

fearful-symmetry commented 9 months ago

So, as far as I can tell, there's nothing on metricbeat for windows that reports the count of open file handles. The FD.Open value in elastic-agent-system-metrics isn't populated on Windows as far as I can tell. @leehinman it looks like netdata has that at least, do you know how difficult it would be to add to the metrics reporter?

cmacknz commented 8 months ago

Spoke with @fearful-symmetry in our team meeting today, we need to make sure that the test forces the agent to do real work at an accelerated rate. This should help us identify leaks much faster. My ideal test here is primarily a stress test that after 15-30 minutes can detect increasing resource usage. This will make it possible to run this test directly in our PRs instead of separately on a schedule. We can do this by:

I've added these steps to the issue description.

fearful-symmetry commented 8 months ago

@cmacknz so, I'm a little worried about the first item on that list, and I wonder if perhaps it should be broken out to a separate unit test or something? Wouldn't any relevant policy updates force restarts of components or underlying modules in beats? If so, that might interrupt any kind of in-process resource leak.

cmacknz commented 8 months ago

Wouldn't any relevant policy updates force restarts of components or underlying modules in beats? If so, that might interrupt any kind of in-process resource leak.

The agent should not restart components that are unmodified. It would only be starting and stopping the input we are adding and removing.

There are some quirks to this, like if you start changing output parameters the Beats will restart, which is why I suggested only adding and removing a single input.

I am fine with breaking this up and treating this as a parent level meta issue. What I don't want is for this issue to be closed if we have only solved the problem for a narrow set of possible resource leaks.

fearful-symmetry commented 8 months ago

Yah, the first item is different enough that I feel like it should probably be spun out into a different issue, since so far this has been focused on the behavior of the integrations and resource leaks stemming from OS/external APIs. The PR has been updated to reduce the collection interval to 1s.

cmacknz commented 8 months ago

I had a thought on how we could "test the test" here as part of our exercise to prove this can actually detect resource leaks:

Can we implement an input that can be configured to leak memory and OS level resources in various ways? We could constantly leak goroutines at a configurable rate, leak Windows handles, file handles, open an increasing number of sockets, etc.

This would give us a way to evaluate the effectiveness and stability of the test. We are going to need to spend considerable time convincing ourselves the false positive rate is acceptable, and also that we can still detect true positives.

cmacknz commented 8 months ago

I've split the implementation here into pieces, from easiest (in the sense that we have already implemented in a PR) to more challenging. The goal is to get checks we think we can stabilize onto main sooner.

cmacknz commented 7 months ago

Something else to follow up on, we can likely do targeted versions of these tests in individual unit tests. For goroutines there is https://github.com/uber-go/goleak that could be used off the shelf, but I bet we could write a similar detection package for OS handle leaks that can be used directly in unit tests.

fearful-symmetry commented 7 months ago

@cmacknz yeah, I've been tinkering with ideas for some kind of handle leak detector that can be run on the unit-test level. I would argue it's sort of necessary, since there's no way we could test everything in the existing integration test. I'll bring it up at the next team meeting

pierrehilbert commented 6 months ago

After discussing with @fearful-symmetry , https://github.com/elastic/elastic-agent/issues/4208 isn't really part of this meta scope and I will close this one. We will continue directly on https://github.com/elastic/elastic-agent/issues/4208 for this last piece.