lucabelluccini commented 2 months ago

Problem

Users are interested into knowing the scaling model of integrations / data streams. Examples:

If I deploy N Elastic Agents with the same policy, will I speed up the ingestion or I'll get duplicates?

Possible proposal (mitigation)

We manually specify in the manifest what is the scaling model of the integration. We expose the scaling model in the docs and, if possible, in the Fleet Integrations UI.

Possible proposal (long term)

Each input should have a metadata/spec where it claims its scaling model.

This might require changes to Beats or any component running under Elastic Agent

The integration package manifest checks the inputs in order to automatically generate the scaling model of the integration / data stream.

Example:

AWS S3 Bucket via polling -> Scales vertically. If multiple instances are started with the same policy, you get duplicates (doc ref)
AWS S3 Bucket via SQS -> Scales horizontally. If multiple instances are started with the same policy, they will concur to increase the throughput (doc ref)
etc...

FYI @lalit-satapathy @jsoriano @zmoog

kpollich commented 1 month ago

It seems like the scaling model is generally tied to the inputs (Beats) in use by a given integration, and not so much the integration itself.

For instance, the S3 input is mentioned as not being horizontally scalable when the queue_url parameter is provided. Could we make an assumption that any input of type aws/s3 with a non-nullqueue_url variable should result in a warning about the scalability model being displayed?

I'm sure there are other cases of this with "pull" based inputs like httpjson, but it's not necessarily my area of expertise.

We manually specify in the manifest what is the scaling model of the integration. We expose the scaling model in the docs and, if possible, in the Fleet Integrations UI.

While I think this makes sense as a path forward, getting broad adoption across integrations in order to source data like this is usually a long-lived challenge. It would require each integration maintainer to provide this information, produce a new version of their integration including an updated format_version to use the new package-spec fields, and then for users to upgrade to the new version of the integration in order to be presented with the new data around scalability present on the integration.

This seems like a lot of churn to get something like this done, so I wonder if we should consider a less involved approach, such as adding specific detection in the Fleet codebase where we detect these scalability concerns based on a "hardcoded" mapping of metadata around specific input types or variables.

cc @nimarezainia I am going to assign this to you as it's in "Needs PM Prio" as well.

andrewkroh commented 1 month ago

It seems like the scaling model is generally tied to the inputs (Beats) in use by a given integration, and not so much the integration itself.

I agree. It would be simplest if we could identify the scaling model based solely on the input (without other caveats or special cases).

I think configuration options we present to users, the agent handlebar config templates, and identifying the scaling model would be easier if we could treat the two aws-s3 input use case as independent inputs. Perhaps we add two alias names to the aws-s3 input in the spec like aws-s3-polling and aws-s3-sqs. This would then make it possible for a package developer to have separate inputs for each s3 use case. This addresses a few issues we have:

Users would then be presented with distinct options in the Fleet UI and API. It's extremely confusing to users today. I have observed people setting both bucket ARNs and queue URLs. Or setting both "number of workers" (only used in polling mode) and "max number of message" (sqs).
Developers maintain simpler templates (src).
The scaling model is clear based on the input name.
It's easier to estimate resource requirements. s3 polling mode often has much higher steady state memory requirements because it keeps the state of every s3 object discover from the bucket in memory.

nimarezainia commented 1 month ago

Does the package spec need to be modified at all? there are only a bunch of integrations/inputs that we would need to consider here, Mainly pub/sub ones we are faced with a conduit that feeds us the events and/or read directly via polling.

@lucabelluccini Could we not just document the scaling model for majority of these integrations?

I think separating aws-s3-polling and aws-s3-sqs is a warranted separate issue to deal with.

lucabelluccini commented 1 month ago

Hello @nimarezainia A first step might be documenting the scaling model. It would be already of great help. The problem is docs are often going stale and currently integration docs would need a dedicated section for such topic.

My manifest proposal was more towards taking a declarative approach from integration developers.

For declaring the scalability at input level or integration level, I am ok with both options. The important thing is to solve the problem of knowing the scalability model.

My suggestion of doing it at integration/data stream level was to "hide" the implementation detail (example: in the future an integration/data stream might change), but the final user rarely knows what input is used for each one.

If we're able to expose the scaling model based on the input used, than it is fine for me.

lucabelluccini commented 1 month ago

Discussed with @nimarezainia yesterday

(baseline) introduce a scalability documentation for all inputs
- (to be reviewed by Engineering) We should document at least the following inputs
  - aws-cloudwatch -> vertical scaling (local cursor stored in registry)
  - aws-s3 (polling) -> vertical scaling (local cursor stored in registry, no way to sync concurrent consumers)
  - aws-s3 (sqs) -> horizontal scaling (notification based, consumers ack the consumed events back to AWS)
  - azure-eventhub -> horizontal scaling (storage account + storage account container to store the consumers state/cursor)
  - azure blob storage input -> vertical scaling (workers)
  - gcp-pubsub -> vertical scaling (num_goroutines) + horizontal scaling (subscription)
  - google cloud storage input -> vertical scaling (workers)
  - salesforce -> vertical scaling (local cursor stored in registry)
(baseline) expose to the user the documentation of the inputs used at Integration/Datastream level within the Fleet UI
- (to be checked by Engineering) This might require possible modifications to the manifest in order to tell Fleet which inputs are used by which data stream
(enhancement) introduce a non-mandatory scalability model attribute for inputs so that it can be programmatically used by Fleet
- example: scalability_model which can contain tags, such as vertical, horizontal or vertical (num_goroutines) or horizontal (storage_account)
(enhancement) make use of the scalability model attribute to warn users the fact a policy containing an integration/data stream making use of a vertical-scalable input is deployed to N > 1 agents is likely going to waste resources

As this subject / topic is related to integrations, I'm putting in the loop also @daniela-elastic for the O11y-owned inputs.

andrewkroh commented 1 month ago

I think we should try to lean into automation so that we these classifications for each integration don’t require much work to maintain. I would like to see attributes like horizontal/vertical scaling, stateful/stateless, and e2e acknowledgement support being tracked as metadata about the inputs we have (and kept near the input source). Then the reference docs for the inputs (e.g. Filebeat docs) and the integrations docs can derive from this metadata.

As an example, the simple tags that Vector adds to their input docs convey a lot of useful information.

gcp-pubsub -> vertical scaling (num_goroutines) + horizontal scaling (subscription)

gcp-pubsub has the same scaling characteristics as aws-s3 (sqs) (horizontal). So whatever we list for s3 should be the same for pubsub.

nimarezainia commented 1 month ago

As a starter let's modify the package spec to allow for this information to be set by the package owner. And for it to be included in the auto-generated integrations docs/integrations plugin.

elastic / integrations

[Enhancement] Scaling options for each integration/data stream #11195

Problem

Possible proposal (mitigation)

Possible proposal (long term)