elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
120 stars 129 forks source link

[OTel] Additional components for the Elastic Distro #5092

Closed ChrsMark closed 2 days ago

ChrsMark commented 3 weeks ago

Describe the enhancement:

K8s is one of the most used envs according to https://opentelemetry.io/blog/2024/otel-collector-survey/#deployment-scale-and-environment, hence it makes sense to consult what the community distro for k8s includes. We can evaluate the list of the k8s-distro components and consider including them as well in the Elastic distro: https://github.com/open-telemetry/opentelemetry-collector-releases/blob/main/distributions/otelcol-k8s/manifest.yaml

/cc @elastic/otel-devs

Describe a specific use case for the enhancement or feature:

What is the definition of done?

Taken from https://github.com/elastic/elastic-agent/issues/5092#issuecomment-2225517430:

ycombinator commented 3 weeks ago

@strawgate @AlexanderWert What is the priority for this issue? Which release do we want to target it to?

AlexanderWert commented 3 weeks ago

We can evaluate the list of the k8s-distro components and consider including them as well in the Elastic distro

Looking at the list of components it seems like there are still many components that sound less relevant for our distro (like spanmetricsconnector, servicegraphconnector, etc.). As you wrote, I think we need to evaluate and come up with our own list inspired by that list.

@strawgate @AlexanderWert What is the priority for this issue? Which release do we want to target it to?

I think 8.15.2 would be reasonable for that, since that's when we plan to have a broader k8s integration for OTel.

ycombinator commented 3 weeks ago

Looking at the list of components it seems like there are still many components that sound less relevant for our distro (like spanmetricsconnector, servicegraphconnector, etc.). As you wrote, I think we need to evaluate and come up with our own list inspired by that list.

What would the criteria be for deciding whether to include a component from the list in our distro vs. not include it?

ChrsMark commented 3 weeks ago

What would the criteria be for deciding whether to include a component from the list in our distro vs. not include it?

I would consider as required those that are defined by default in the upstream's Helm Charts: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/values.yaml

With this users that already use the upstream charts could switch to the Elastic distro by just defining the image properly:

helm install my-opentelemetry-collector open-telemetry/opentelemetry-collector --set mode=daemonset --set image.repository="docker.elastic.co/beats/elastic-agent" --set image.tag="8.15.0-SNAPSHOT" --set command.name="/usr/share/elastic-agent/elastic-agent" --set command.extraArgs="{otel, -c, --config=etc/elastic-agent/otel.yaml}"

The above gives me the following in the failed Collector's Pod logs:

Starting in otel mode
1 error occurred:
    * failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):

error decoding 'receivers': unknown type: "prometheus" for id: "prometheus" (valid values: [otlp filelog kubeletstats k8s_cluster hostmetrics httpcheck])
error decoding 'processors': unknown type: "memory_limiter" for id: "memory_limiter" (valid values: [transform filter k8sattributes elasticinframetrics resourcedetection batch resource attributes])
error decoding 'extensions': unknown type: "health_check" for id: "health_check" (valid values: [file_storage memory_limiter])

A working PoC: https://github.com/elastic/opentelemetry-dev/pull/331

rogercoll commented 1 week ago

As defined in the collector best practices docs, the memory_limiter processor should also be included to efficiently manage the resource pipelines: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-around-resource-utilization

ycombinator commented 1 week ago

As defined in the collector best practices docs, the memory_limiter processor should also be included to efficiently manage the resource pipelines: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-around-resource-utilization

~In https://github.com/elastic/elastic-agent/pull/4689, we explicitly chose the memory_limiter extension over the processor because the extension doesn't drop events but instead puts backpressure on the receivers. Do you think we should revert that decision and go with the processor again? cc: @michalpristas~

[UPDATE] Chatting with @cmacknz off-issue, there's no reason not to include both the extension and the processor, if we need the processor for our collector distro to work on k8s, so will keep the memory_limiter processor in the list of additional components.

rogercoll commented 1 week ago

@ycombinator I was checking the extension and from the docs it seems that in the future it should be used in favor of the processor:

The memory limiter extension is used to prevent out of memory situations on the collector. The extension will potentially replace the Memory Limiter Processor.

But the extension is currently nop:

The extension is under development and does nothing.

I suggested adding the processor because the upstream Helm Chart requires it. Opening an issue to remove the dependency in Helm https://github.com/open-telemetry/opentelemetry-helm-charts/issues/1272

michalpristas commented 1 week ago

let's also monitor size of binary, difference between core collector and full contrib collector is a bit over 300MB @cmacknz do we find this concerning?

cmacknz commented 1 week ago

Including the collector has to make agent larger, and we need more functionality than what the core collector has, so this is really unavoidable. Be mindful of what components we include to make sure they have a purpose (e.g. we probably don't need everything in contrib) but there's not much else we can do in the short term.

rogercoll commented 1 week ago

From the Helm deployment perspective, only the health_check extension is required in the values configuration. Other components (jaeger, zipkin, etc.) can be removed from the sample configuration by setting their value to null: https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector#basic-top-level-configuration

Meanwhile, the health_check requirement can also be skipped in the Helm deployment by using a custom configmap (demo example)

ChrsMark commented 3 days ago

FYI a working example can be found at https://github.com/elastic/opentelemetry-dev/pull/331, which uses https://github.com/elastic/elastic-agent/pull/5126.