elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Support the Beats disk queue in Elastic Agent #3490

Open cmacknz opened 11 months ago

cmacknz commented 11 months ago

The Beats today support a disk queue that has been GA for some time, however it cannot be used with the Elastic Agent. Part of the reason why is that Elastic Agent does not allow configuring the queue configuration at all, but this will change after https://github.com/elastic/beats/pull/36693 is merged.

Those changes would allow a user to enable the Beats disk queue, which with no other changes would instruct each Beat to create a disk queue in the same directory. That is the disk queue is not shared between processes, there is a disk queue per process, and each per process disk queue will conflict attempting to use the same files in the same directory.

For the disk queue to work properly when running under the Elastic Agent without a dedicated shipper process we need to orchestrate the queue directories correctly in the agent itself. Specifically we need to:

  1. Create a dedicated directory in the agent installation path for the disk queue files. The natural choice for the disk queue location would be the per component run directory in the versioned data path, however this would require the entire queue to be copied on upgrade. I think we should avoid this because the disk queue can be large (100+ MB depending on configuration and usage), and instead created a dedicated outside of the versioned data path that is shared between versions of the Elastic Agent. We will likely need a file lock in the directory to ensure only one version can read from this directory at a time.

  2. In the dedicated queue directory, provision a unique disk queue sub-directory for each component since queues cannot be shared between processes. The disk queue for a component should be removed when the component is removed from the agent policy.

  3. Allow the user to configure the dedicated disk queue directory. Users may want the disk queue to reside on a dedicated volume, which will be particularly important when the Elastic Agent is running on Kubernetes and the user wishes for the disk queues to be stored on a persistent volume claim.

We will also need to performance test the Elastic Agent running with the disk queue, and compare it to the Elastic Agent without the disk queue. The disk queue has a performance penalty because events must be serialized before being written to disk. We should quantify what this penalty is, particularly when the Elastic Agent is supervising multiple Beats each with their own disk queue.

The final caveat to this implementation is that the disk queue will only be supported for inputs which are based on Beats. We should add the ability for agent specification files to declare whether they support the disk queue configuration. The one special case to consider is endpoint-security which always uses a disk queue that is different from the one implemented in Beats. We will need to make this obvious to users.

elasticmachine commented 11 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse commented 11 months ago

Elastic Agent spawns each component with its own work directory that is consistent. Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

cmacknz commented 11 months ago

Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

That works as long we don't have to copy the queue on upgrade for the reasons mentioned in https://github.com/elastic/beats/issues/35615#issuecomment-1745151235

blakerouse commented 11 months ago

Why can't the disk queue just write to the work directory that is give to the process when it is started by the Elastic Agent?

That works as long we don't have to copy the queue on upgrade for the reasons mentioned in elastic/beats#35615 (comment)

It is placed in the run directory which is copied.

https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/upgrade.go#L166

mbudge commented 11 months ago

Disk queues reduce the risk of data loss, but at the same time I can see disk-queues hammering some of our busy production servers.

Use in-memory queue and fall back to disk queue when there is a network issue.

rdrgporto commented 2 months ago

Hi,

Any updates on this topic? I think it would be a good feature for Elastic Agent in case of network outages that could last for hours.

Regards

nimarezainia commented 2 months ago

Hi,

Any updates on this topic? I think it would be a good feature for Elastic Agent in case of network outages that could last for hours.

Regards

Hi, Yes this is in our plans and we do realize that it is an important feature to have, unfortunately however it's prioritized behind other features we have on the backlog.

mbudge commented 2 months ago

Best way this to use in memory and fall back to disk queue when there’s a network issue.

This reduces performance impact on production servers with high log throughput.

itay-ct commented 1 month ago

+1 Customer asked about this, would be happy to know when this is planned or get any workarounds that we can provide (not sure if the "use in memory and fall back to disk" is something a customer can implement).