elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
35 stars 448 forks source link

[Enhancement] Scaling options for each integration/data stream #11195

Open lucabelluccini opened 2 months ago

lucabelluccini commented 2 months ago

Problem

Users are interested into knowing the scaling model of integrations / data streams. Examples:

Possible proposal (mitigation)

We manually specify in the manifest what is the scaling model of the integration. We expose the scaling model in the docs and, if possible, in the Fleet Integrations UI.

Possible proposal (long term)

Each input should have a metadata/spec where it claims its scaling model.

The integration package manifest checks the inputs in order to automatically generate the scaling model of the integration / data stream.


Example:

FYI @lalit-satapathy @jsoriano @zmoog

kpollich commented 1 month ago

It seems like the scaling model is generally tied to the inputs (Beats) in use by a given integration, and not so much the integration itself.

For instance, the S3 input is mentioned as not being horizontally scalable when the queue_url parameter is provided. Could we make an assumption that any input of type aws/s3 with a non-nullqueue_url variable should result in a warning about the scalability model being displayed?

I'm sure there are other cases of this with "pull" based inputs like httpjson, but it's not necessarily my area of expertise.

We manually specify in the manifest what is the scaling model of the integration. We expose the scaling model in the docs and, if possible, in the Fleet Integrations UI.

While I think this makes sense as a path forward, getting broad adoption across integrations in order to source data like this is usually a long-lived challenge. It would require each integration maintainer to provide this information, produce a new version of their integration including an updated format_version to use the new package-spec fields, and then for users to upgrade to the new version of the integration in order to be presented with the new data around scalability present on the integration.

This seems like a lot of churn to get something like this done, so I wonder if we should consider a less involved approach, such as adding specific detection in the Fleet codebase where we detect these scalability concerns based on a "hardcoded" mapping of metadata around specific input types or variables.

cc @nimarezainia I am going to assign this to you as it's in "Needs PM Prio" as well.

andrewkroh commented 1 month ago

It seems like the scaling model is generally tied to the inputs (Beats) in use by a given integration, and not so much the integration itself.

I agree. It would be simplest if we could identify the scaling model based solely on the input (without other caveats or special cases).

I think configuration options we present to users, the agent handlebar config templates, and identifying the scaling model would be easier if we could treat the two aws-s3 input use case as independent inputs. Perhaps we add two alias names to the aws-s3 input in the spec like aws-s3-polling and aws-s3-sqs. This would then make it possible for a package developer to have separate inputs for each s3 use case. This addresses a few issues we have:

nimarezainia commented 1 month ago

Does the package spec need to be modified at all? there are only a bunch of integrations/inputs that we would need to consider here, Mainly pub/sub ones we are faced with a conduit that feeds us the events and/or read directly via polling.

@lucabelluccini Could we not just document the scaling model for majority of these integrations?

I think separating aws-s3-polling and aws-s3-sqs is a warranted separate issue to deal with.

lucabelluccini commented 1 month ago

Hello @nimarezainia A first step might be documenting the scaling model. It would be already of great help. The problem is docs are often going stale and currently integration docs would need a dedicated section for such topic.

My manifest proposal was more towards taking a declarative approach from integration developers.

For declaring the scalability at input level or integration level, I am ok with both options. The important thing is to solve the problem of knowing the scalability model.

My suggestion of doing it at integration/data stream level was to "hide" the implementation detail (example: in the future an integration/data stream might change), but the final user rarely knows what input is used for each one.

If we're able to expose the scaling model based on the input used, than it is fine for me.

lucabelluccini commented 1 month ago

Discussed with @nimarezainia yesterday

As this subject / topic is related to integrations, I'm putting in the loop also @daniela-elastic for the O11y-owned inputs.

andrewkroh commented 1 month ago

I think we should try to lean into automation so that we these classifications for each integration don’t require much work to maintain. I would like to see attributes like horizontal/vertical scaling, stateful/stateless, and e2e acknowledgement support being tracked as metadata about the inputs we have (and kept near the input source). Then the reference docs for the inputs (e.g. Filebeat docs) and the integrations docs can derive from this metadata.

As an example, the simple tags that Vector adds to their input docs convey a lot of useful information.


gcp-pubsub -> vertical scaling (num_goroutines) + horizontal scaling (subscription)

gcp-pubsub has the same scaling characteristics as aws-s3 (sqs) (horizontal). So whatever we list for s3 should be the same for pubsub.

nimarezainia commented 1 month ago

As a starter let's modify the package spec to allow for this information to be set by the package owner. And for it to be included in the auto-generated integrations docs/integrations plugin.