elastic / logstash-filter-elastic_integration

The Elastic Integrations filter for Logstash, which enables running Elastic Integrations inside of Logstash pipelines
Other
5 stars 8 forks source link

Allow logstash to output to any ES #75

Open ThorbenJ opened 1 year ago

ThorbenJ commented 1 year ago

Adding as per: https://elastic.slack.com/archives/C05B0S4BANN/p1688658625436089

Currently we assume Logstash will output to the same ES that is managing fleet/agent. In that case data streams (templates) are installed/initialised.

However there are scenarios with this would not be the case: 1) CCS architecture but with a central Fleet manager cluster (as one might see at MSSPs) 2) Sending the data to multiple ES, as might be the case for an MSSP customer

It would be great if this plugin could also grab data stream definitions such as templates, that could be then used by the elasticsearch output plugin.

I am told this would require the output plugin to track data streams already seen, and for new data streams to do a template check when first seen instead of at startup.

yaauie commented 9 months ago

While I understand the high-level ask here, there are some pretty enormous architectural challenges that stand in the way of making this a reality, especially from this plugin's perspective.

I am told this would require the output plugin to track data streams already seen, and for new data streams to do a template check when first seen instead of at startup.

This is likely the largest, for several reasons.

Even if we were able to efficiently embed index templates (composable or legacy) inside events (such as referencing shared immutable objects, which would be a challenge to solve in Logstash core), the ES output would need to inspect each batch for a delta from its own state and push that delta to Elasticsearch before pushing the batch of events. This would introduce shared mutable state to the ES output plugin (what happens if two batches, running concurrently in separate workers, include events with templates that conflict or overlap?), which would likely require locking or synchronization that decimates throughput.

I believe that if we back up to those two use-cases above, tooling to keep the integration-powered templates in-sync from a primary to one or more secondaries would be an easier lift (but that is far outside the scope of this plugin).

Would you be willing to file an ER so this can be addressed at a wider scope, so we can close this plugin-targeted issue?

ThorbenJ commented 9 months ago

Could we not reduce scope from every general case to just specific cases?

e.g. Only for datasets covered by the EPR. Per worker note if a dataset (event.dataset) has been seen before, if not then fetch its definition from the EPR and push it. All workers would use the same EPR, would only do this once per dataset for the life time of a logstash instance. True if there are 6 workers, it will likely push the templates 6 times, does that matter? This would only work for datasets covered by the EPR (https://github.com/elastic/package-registry)

I agree this could be a feature entirely independent of this specific plugin, perhaps we can move this issue to the data streams / elasticsearch output plugin.