Reduce the amount the agent logs by default

strawgate commented 7 months ago

Describe the enhancement:

We'd like an audit to occur on the current logged messages to make sure that the information at the default INFO log level is useful and is valuable when shipped to Fleet.

Describe a specific use case for the enhancement or feature:

10-15% of data produced by Agent is internal logs and metrics, we'd like to reduce the volume of Agent internal monitoring as excessive production of this data consumes significant disk space within customer environments.

What is the definition of done?

We have measurably reduced the amount of logs agent generates by default. Ideally we will have reduced agent monitoring log volume by 80% in normal usage and have run out of obvious ideas or superfluous messages to remove.

elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 6 months ago

I worry about doing this until https://github.com/elastic/kibana/issues/158861 is implemented in case it turns out we forgot to convert some critical information to the warn level and it is needed for debugging. I don't want users with support cases having to manually revert to the info level on an agent by agent basis.

There are definitely a lot of low value logs at the info level today, so what we could do is identify those and improve/remove/convert them to debug. I think the spirit of this issue is to reduce the cost of storing agent logs, we can likely make significant progress on that without having to actually change the default log level.

pierrehilbert commented 6 months ago

Thx for raising this @cmacknz. I agree here and set this one as blocked for now.

nimarezainia commented 6 months ago

@pierrehilbert We still have time to get all this in before FF. I think it's important for our users to have these changes. I will work with the Fleet team to make the UI changes that is required. But I suggest this work is not blocked and we move forward. This is not a big change (code wise not impact)

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

pierrehilbert commented 6 months ago

This was not part of current sprint and the FF is on Tuesday. As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

cmacknz commented 6 months ago

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

No I don't think agent respects log level changes from the policy right now.

I also think this issue is prescribing a solution instead of stating the problem. I think we want to send fewer logs by default at all levels. Changing the default level to warn is just a quick way to reduce log volume, but also by removing critical information.

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

amitkanfer commented 6 months ago

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

Possible solution for this is to introduce an ALWAYS log level. Something i did successfully in the past. This level will log all those messages we always want to log (similar to ERROR i believe, in terms of severity), but they are not errors nor warnings. We'll need to spend some time to identify such logs ofcourse, but this will unblock us from changing the default to WARN.

nimarezainia commented 6 months ago

This was not part of current sprint and the FF is on Tuesday. As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

@pierrehilbert the Fleet issue is scheduled for sp27. It is dependent on https://github.com/elastic/elastic-agent/issues/2851 so we have a circular situation. #2851 is indeed sp27. So I'm asking for 2851 to be completed.

@cmacknz Fleet issue and #2851 will set the default to Warn for every policy. Which will address the concern. We can use this issue to work on readjusting which category various logs belong to which is a longer process.

cmacknz commented 6 months ago

@cmacknz https://github.com/elastic/kibana/issues/158861 and https://github.com/elastic/elastic-agent/issues/2851 will set the default to Warn for every policy.

This issue has to be done first before we can change the default level to warn. We can add the ability to change the log level on a policy basis, and customers that care can opt in to that, but I don't want to change the default until we are sure the default has the information we want..

If we change the default log level to warn without adjusting what is logged at the warn level, then every single support case will have the additional step "adjust the log level back to info for the affected agents".

This adds friction and will increase how long it takes to resolve support escalations. I don't think this is a good thing.

nimarezainia commented 6 months ago

thanks @cmacknz. then I suggest we arrange the deck chairs since the other two issues are currently slated for completion in sp27. This issue shouldn't have been marked as blocked by https://github.com/elastic/kibana/issues/158861 - i understand why it was done as we need a way to quickly go back to the old default value. But either way as you said this issue has to be completed first.

cmacknz commented 6 months ago

Yes, I retitled this to "reduce the amount the agent logs" since I don't think it matters what the log level actually is. The point is we can log less than we currently do by default and we need to investigate how to fix that.

Whether we are at info or warn doesn't matter, if we can get info to remain useful and log as much as warn does today we don't need to change the default log level at all.

strawgate commented 5 months ago

30s output statistic summaries would be lost making performance problems impossible to see in diagnostics

could the 30s output statistics messages just be captured by internal agent metricbeat monitoring? or why is it a log line?

cmacknz commented 5 months ago

or why is it a log line?

So it can be in diagnostics captured locally from the agent. Often that is all we get when debugging user issues and we need the data they include, specifically we often need the history they provide over time.

There are other ways to solve the problem the 30s metric summaries in the logs are solving, but we can't just get rid of them without some replacement for the data they provide.

We could stop shipping those metrics summaries to fleet though by adding a drop processor in the monitoring filebeat.

nimarezainia commented 5 months ago

As per discussions, moving this issue to sprint 28 so it can be done in the 8.14 time frame.

nimarezainia commented 5 months ago

@pchila thanks for working on the PR. Would it be possible to get the final benchmark on this and the metricset changes to see what is the percentage reduction all together for default use cases. thank you

pierrehilbert commented 5 months ago

@nimarezainia, @pchila asked ES team for some advice to do this kind of benchmark. For now, only the metricset change has been merged so you won't get more than this one.

cmacknz commented 5 months ago

The data stream API can tell us the size the logs data streams currently take on disk. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-stream-stats-api.html

The main problem will be in creating properly comparable data sets. We need to make sure we cover the same time range for the same operations to avoid doing size calculations where one of the datasets naturally contains more logs because it did more work, or covers a longer time period.

I think we almost want something like the Filebeat --once or an --until command line parameter for the agent that runs for exactly some amount of time and then stops. That might get us close enough, there is always the chance there are a couple of extra log lines from one of the agents but if we run the test for long enough that hopefully won't have a large impact.

We could take a similar approach and run two agents on two Ubuntu VMs started at approximately the same time, pointed at two different ES clusters, then compare the data stream sizes after a period of time.

A third way would be to have the agent log to a file for some well defined period of time, then truncate to align the final timestamps to match, and ingest the results into ES separately and compare the sizes.

pchila commented 5 months ago

@cmacknz I am actually working on a fourth way (I call this the "salami slice index" method) :

Clone the index template logs-elastic_agent onto a new template without using it for a data stream
Reindex a subset of the logs we have on the logs-elastic_agent-* datastream onto a new index matching the new index template (we can select by agent id, agent version and/or timeframe etc.) as a baseline (first slice)
Reindex another subset of logs for the updated version for the same timeframe on another index (second slice) At this point we can compare the indices and it should be more apples to apples comparison...

Edit: forgot to mention, I am preparing a small script for setting up the templates, indices etc (if useful I could commit it in the elastic-agent repo) but testing the method I noticed that removing the periodic Non-zero metrics... message reduces the index size by ~9.2% on a short agent run (~1 hour)

strawgate commented 5 months ago

can we test enabling synthetic source in the index_template? that may be a ~50% win

pchila commented 5 months ago

I redid my measurement for what we gain by removing the Non-zero metrics... log in the related PR description https://github.com/elastic/elastic-agent/pull/4633#issue-2269233560

elasticmachine commented 5 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pchila commented 5 months ago

can we test enabling synthetic source in the index_template? that may be a ~50% win

Did a quick check on the synthetic source (according to the documentation it's GA only on time_series indices) and elastic-agent metrics indices in 8.15.0 already have this setting:

so not much to gain there, unfortunately

pchila commented 5 months ago

@strawgate #2851 has a PR open and it should be merged during the current sprint...

I was wondering if the definition of done for this issue is still the one quoted below

What is the definition of done?

Agent logs at Warn level by default or switch to warn level shortly after Agent startup.

Start agent, notice no more Info level messages and an overall reduction in log volume.

According to @nimarezainia 's comment on the kibana issue https://github.com/elastic/kibana/issues/180778#issuecomment-2101867753 I think we want the initial log level to be INFO ? Or we still want to have warning as initial log level ?

Furthermore, since the last 8.14.0 BC has already started I am not sure we can cram more code modifications there so this will likely be implemented in a future release

strawgate commented 5 months ago

The definition of done is probably something like: reduce agent monitoring log volume by 80% in normal usage.

Initially, we thought this would be done by switching the default log level as doing that is easily reversible by customers. Later the decision was made to just trim the logs present in Info.

cmacknz commented 5 months ago

Yes, I just adjusted the description and definition of done.

We can still target changes to 8.14.1 by backporting them after the 8.14.0 release.

ycombinator commented 4 months ago

Next step, as discussed in today's team meeting:

Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.
Using this setup, run Agent for a reasonable period of time (5 minutes?) so that we collect enough log data to see statistically significant patterns (see next point)
Figure out statistical distribution of the logs and use this to figure the most frequently-emitted logs, allowing us to focus on these towards the goal of reducing the amount of agent logs by default.

ycombinator commented 4 months ago

Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.

@strawgate Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs?

strawgate commented 4 months ago

For logs-elastic_agent* we should target reduction, in this order:

logs-elastic_agent.filebeat-*
logs-elastic_agent.metricbeat-*
logs-elastic_agent-*

We can assume all other logs are a rounding error.

For a sense of scale, we should assume:

a 30% reduction in elastic_agent.filebeat logs is worth a 100% reduction in elastic_agent.metricbeat logs
a 50% reduction in elastic_agent.metricbeat is worth a 100% reduction in elastic_agent logs.
Improvements to any other datastreams are functionally incidental

We should pick the framework that can best allow us to measure total volume and then reductions in the above data streams.

ycombinator commented 4 months ago

Followed up with Bill off-issue re: my question about coming with up a standard Agent profile we should use for benchmarking log reduction improvements. Bringing our conversation back into this issue:

@ycombinator: Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs? @strawgate: I dont think we need defend or osquery and I think we should find whatever integrations leverage filebeat and metricbeat and enable as many as we can reasonably enable.

@cmacknz IIRC, you'd suggested in last week's team meeting that we enable defend and osquery. Could you and @strawgate work out whether we should or should not enable these as part of a standard profile for benchmarking Agent Similarly, we need a list of specific integrations that leverage Filebeat and Metricbeat because "as many as we can" could be pretty much every integration, so we need some specific narrowing criteria implied by "reasonably enable". We will resume work on this issue once we have this specific Agent profile defined.

cmacknz commented 4 months ago

Installing the system integration is all you need to get a single instance of Filebeat and Metricbeat running. System is also great because the data source is the machine agent is running on.

Running more integrations would be great but we need them to do real work to find out if they log too much, and that is much more complicated than system.

Let's just start with a basic baseline of Elastic Agent with the System integration installed to keep things as simple as possible. Most of the improvements we have brainstormed so far are testable with just that.

strawgate commented 4 months ago

reasonably enable

Meant without really doing anything :)

I'll reach out to see if we have any customers that might be willing to run log data analysis and provide us a result like:

Im also wondering if we might want to setup some integrations in blank states (nothing configured) as a one-off to see if there are any producing large volumes of errors (like the nginx file has no content error in the screenshot).

elastic / elastic-agent

Reduce the amount the agent logs by default #4252