[Meta][Feature] Enable filebeat and metricbeat to publish data to the shipper

cmacknz commented 2 years ago

This is a feature meta issue to allow filebeat and metricbeat to publish data to the shipper when run under Elastic agent. All other beats are out of scope.

An output for existing beats should be implemented that publishes to the shipper gRPC interface. When the shipper gRPC output is used, the beat output pipeline should be configured to be as simple as possible. Using a per beat disk queue with the shipper is forbidden. A memory queue may be used with the shipper output, but how it should be configured by users will require careful consideration. Ideally any necessary queue configuration can be made automatic.

Removing processors from beats is out of scope for this issue. Processors will be removed in a later issue.

This feature is considered complete when at least the following criteria are satisfied for both filebeat and metricbeat:

A test exists proving data ingested by the beat is published to the shipper.
A test exists proving there is no data loss when the shipper process restarts while the beat is publishing.
A test exists proving there is no data loss when the shipper backpressures the beat (because the shipper queue is full for example).

The assignee of this issue is expected to create the development plan with all child issues for this feature. The following set of tasks should be included in the initial issues at a minimum:

Creating a beats output that that publishes to the shipper gRPC interface.
Defining a standard configuration for using a beat with the shipper that the control plane can easily apply: processors disabled, queues disabled, etc.
Creating an integration test suite for the beat and shipper interactions.

UPD by @rdner

I split this in the following steps:

[x] elastic/elastic-agent-shipper#22
[x] elastic/elastic-agent-shipper#23
- An event batch is published from an input to the shipper gRPC server
- An event batch is not dropped when the gRPC server is not available but starts later
- An event batch is not dropped when ResourceExhausted code is returned from the gRPC server, TTL does not decrease in this case
[x] elastic/elastic-agent-shipper#24
[x] elastic/elastic-agent-shipper#34

rdner commented 2 years ago

Creating a beats output that that publishes to the shipper gRPC interface.

@cmacknz I'm a bit confused about this sentence.

When we talked 1 on 1, we agreed that during the very first iteration the gRPC server will be one of the output options along with Elasticsearch, File output, Kafka, etc.

Later at the team call I asked the same question to widen the discussion circle but then you answered something different about having a feature flag and switching some logic in the code.

I think we have some miscommunication about this.

I see 2 options how to approach this task:

Option 1

We have it as a new experimental output type which we could configure like this:

output:
  shipper:
    server: "localhost:50051" # The server address in the format of host:port
    tls: true # Connection uses TLS if true, else plain TC
    ca_file: "/home/cert" # The file containing the CA root cert file
    server_host_override: "x.test.example.com" # The server name used to verify the hostname returned by the TLS handshake

This can be achieved with the following steps:

We create a new package shipper in here https://github.com/elastic/beats/tree/main/libbeat/outputs (or perhaps in elastic-agent-libs)
We implement the Client interface
We implement the new shipper output type factory.
We use the existing pipeline without any changes.

In this case, changes of the existing code are none or minimal and we can start working with the new setup, debug and perform tests. The new output type can be excluded from the documentation if needed. Later we can just replace the whole pipeline implementation when we feel the shipper is ready.

Option 2

We have a feature flag to switch the pipeline to a separate implementation that starts sending events to the shipper instead of configured outputs.

This will require us:

Refactor the current pipeline implementation so it's an interface that can have 2 different implementations instead of a struct

Create a new configuration section support at the root level where we can configure a shipper, e.g.:

shipper:
server: "localhost:50051" # The server address in the format of host:port
tls: true # Connection uses TLS if true, else plain TC
ca_file: "/home/cert" # The file containing the CA root cert file
server_host_override: "x.test.example.com" # The server name used to verify the hostname returned by the TLS handshake

If the configuration section exists, the pipeline implementation is switched to the ShipperPipeline and the beat's output configuration is ignored

The major drawback here is that we would need more time and to make a lot of changes to the existing code instead of just adding new that can affect stability. On the other hand, we would need to do that at some point too.

cmacknz commented 2 years ago

I recommend option 1 as it will be simpler to implement and maintain in the long term. It follows the model currently used by Elastic agent to configure outputs for beats.

ph commented 2 years ago

I prefer also option 1, so we don't have a special case or transformation to do.

faec commented 2 years ago

I'm not sure how option 1 fits with the other pending pieces. I think perhaps there's been some confusion with the "output" language that is being used for two different stages of processing: (1) sending data from the input to the processor / shipper before it enters the queue, and (2) sending final event data from the shipper to the upstream target (elasticsearch, logstash etc) after it exits the queue.

So I'm not sure how option 1 would fit right now -- the Client interface is the final link of the Beats pipeline that hands off to the upstream, so if we connect this output there, then events would go through the whole current pipeline (including processors and the memory queue) before being sent to the shipper, which is also supposed to handle the memory queue. So to me, option 2 makes more sense, since it diverts to the shipper before hitting the queue.

I wonder if the confusion about approaches comes from the use of "output" to refer to both of those components? Because option 1 sounds to me like a reasonable sketch of the output of the shipper, but as I understand it in the first pass we're just handling that with a placeholder raw-file output.

cmacknz commented 2 years ago

Yes, the language isn't precise enough, neither does the fact that the beat pipeline and the shipper will have overlapping functionality.

My view is that the development needs to be an iterative process where we start with some duplication between the beat and shipper just to get them connected to each other, and then slowly migrate functionality from the beat side into the shipper when run under agent.

I think initially we start with option 1, where we just make it possible for a beat to communicate with the shipper over gRPC. Both the beat and the shipper at this stage have a memory queue, and the processors only exist on the beat side. This is what the diagram in the issue description is trying to show :)

Once we have that, we next work on trying to remove the queuing from the beat side, followed by processing. At this point we may need to consider something like option 2 to try to strip down what the beat/input needs to run.

I like starting with Denis' option 1 to get a faster end to end prototype. Once we have that and can test the interaction between the beats and shipper we will likely need to consider something like option 2. I think we'll be better positioned to make design adjustments after we have a quick prototype than pursuing larger changes from the beginning. I could be convinced otherwise though.

faec commented 2 years ago

Ah ok, so the redundancy in the memory queue is an intentional temporary workaround? In that case fair enough, let's continue :-)

kvch commented 2 years ago

Does adding a feature flag make sense in beats? It is just basically a setting that enables or disables features. How is that different from setting output.elasticsearch instead of output.shipper (by Agent) if we want to fallback to the old way of sending events?

rdner commented 2 years ago

I've updated the description and added a checklist for tracking the progress.

One thing which is not 100% clear to me is input and data stream options. I could not find a simple way to propagate these parameters through the event batches so I'm going to address this as a separate issue after the initial implementation is there, so it's not blocking any experiments with the new shipper architecture.

The same goes about the integration tests, they will be implemented separately.

cmacknz commented 2 years ago

Thanks! I have a separate issue already for returning acknowledgements from the shipper: https://github.com/elastic/elastic-agent-shipper/issues/9. I expected that would be too much work to fold into this issue.

The input and data stream will have to be propagated from the agent policy, which we may not do yet. We may not need the data stream until we implement processors in the shipper, at which point we'll need a way to apply the correct processors to events based on the input and data stream.

cmacknz commented 2 years ago

Added https://github.com/elastic/elastic-agent-shipper/issues/34 as part of this work.

cmacknz commented 2 years ago

All tasks complete, closing.

elastic / elastic-agent-shipper

[Meta][Feature] Enable filebeat and metricbeat to publish data to the shipper #8

Option 1

Option 2