argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.12k stars 3.21k forks source link

Proposal: add the ability of distributed Tracing #12077

Open LinuxSuRen opened 1 year ago

LinuxSuRen commented 1 year ago

Summary

Currently, we could leverage Prometheus to see some important metrics of Argo Workflow. But it would be wonderful if we could see the tracing, so it's helpful for checking potential neckbottle.

Use Cases

Apache Skywalking is a very popular tool which have both front-end and backend for the tracing feature. And the news is that we don't need modify any current code lines to archive that. Apache Skywalking Go Agent will do it during the compile process.

I'd be happy to do it if I could got positive feddback.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

terrytangyuan commented 1 year ago

Can users use it directly if it doesn't require any modifications in Argo Workflows? What's left to be done in Argo?

LinuxSuRen commented 1 year ago

Users need to provide the endpoint of SkyWalking during the deployment process once Argo Workflow adds the related build flags like go build -toolexec="skywalking-go-agent".

It can be traced in a variety of ways. For now, it includes gRPC, HTTP, etc.

See an example https://github.com/LinuxSuRen/api-testing/blob/7228009f26d0ac5fdef19a25600e29e2610a74d6/Dockerfile#L40

So, below is the list of people need to do:

terrytangyuan commented 1 year ago

I feel like that type of instructions does not have to be in this repo.

LinuxSuRen commented 1 year ago

I didn't get your point. Could you be more specific?

terrytangyuan commented 1 year ago

So, below is the list of people need to do

That list includes the things users need to do, right? Then it should be live in SkyWalking's repo instead of argo-workflows repo. I don't think we want to add flags like "skywalking-go-agent" in this repo.

LinuxSuRen commented 1 year ago

skywalking-go-agent is a Golang build tool plugin, it could inject some intercepter code into the target project. So, the target project don't need to change the business code lines. Let me try to make it be more clear.

To end users, the could specific the endpoint of SkyWalking server which will collect the tracing data.

To the developer of Argo workflows, they need to update the compiling process. AKA modify the Makefile and Dockerfile.

agilgur5 commented 1 year ago

Optional distributed tracing support could potentially be good to have. It could be enabled by an environment variable and the collector URL and port can be specified in an environment variable as well. We'd want to add it to both the Controller and Server.

@LinuxSuRen Re: Apache Skywalking, is there a reason you prefer that over the standard OTel Go library?

Can users use it directly if it doesn't require any modifications in Argo Workflows? What's left to be done in Argo?

@terrytangyuan for distributed tracing, some library is needed to instrument the code and make it output traces. We should be vendor neutral, which is why the standard OTel libraries would make sense to me (though I would think Apache Skywalking is also vendor neutral). Similar to how Prometheus requires an application to have an Exporter for metrics.

so it's helpful for checking potential neckbottle.

For single application tracing, Argo does already support pprof via optional env var. Distributed tracing primarily helps with requests that span across multiple micro-services. Although it is also useful just to connect to the OTel tooling ecosystem

LinuxSuRen commented 1 year ago

Apache Skywalking, is there a reason you prefer that over the standard OTel Go library?

OpenTelemetry is great. However, it requires users to write the instrument code in Argo workflows. SkyWalking did that work automatically, and inject the code into Argo workflow. So, for the second one, it could keep Argo workflow's code be clean.

LinuxSuRen commented 1 year ago

I tried it locally. Share some screenshots here.

image image image image

agilgur5 commented 1 year ago

OpenTelemetry is great. However, it requires users to write the instrument code in Argo workflows. SkyWalking did that work automatically, and inject the code into Argo workflow. So, for the second one, it could keep Argo workflow's code be clean.

I took a look at the code for the Go Agent (gRPC for example) and I can see it does an AST walk to add interceptors. Although I see there are still are manual trace APIs, so there can still be code in Argo's codebase.

I also checked to see if it can be configured at run-time via env vars (e.g. for connecting to collectors) and it seems like it can be. Although it's not clear to me if it can be entirely disabled at run-time from the configuration -- there's no enabled: true config (as is common in Helm, for instance). If we were to instrument ourselves, we would not add instrumentation what-so-ever if tracing is not enabled via env var.

I tried it locally. Share some screenshots here.

Thanks for testing, was going to ask for some example usage.

Do you know if the Go agent can directly connect to OTel collectors? I would like to see if OTel tooling will work out-of-the-box.

It seems like Skywalking has its own trace format, as the Skywalking server has OTel and Zipkin converters for it to be able to interpret those formats.

For instance, it seems like Grafana Tempo and Jaeger both don't support Skywalking. Those are two popular OSS tools in the space. I also checked to see if a popular vendor like New Relic supported it, and literally couldn't find anything on Skywalking in New Relic's docs (no hits on a search what-so-ever, whereas there are quite a lot on OTel integration). The OTel Collector doesn't seem to support it in core either, though it looks like there are contrib components for a Skywalking Receiver and a Skywalking Exporter.

Connecting to common OSS tools and APM providers easily would be essential for any distributed tracing tooling, so if Skywalking does not make that easy and requires more server-side components, that might be a no-go. For example, the docs for using distributed tracing should be pretty straight forward, but if we're using an uncommon format, that would add a lot of overhead. If it doesn't add much to Argo's codebase but adds a bunch to users' codebases, that would be a sub-optimal trade-off.

LinuxSuRen commented 1 year ago

Thanks for the detailed reply. Users can choose if send the tracing data to SkyWalking server via env. At the meaning time, I'm still exploring OpenTelementry as well. The SK agent sending the data via SkyWalking gRPC interface, it seems do not support send data to OpenTelementry for now.

agilgur5 commented 1 year ago

Users can choose if send the tracing data to SkyWalking server via env.

Oh cool, I saw that the reporter had a discard config, no idea that you were the one that added it very recently!

I was thinking of going further than that though, don't create spans / run instrumentation at all if it's not enabled. Then there's no CPU or memory overhead if you're not using distributed tracing.

LinuxSuRen commented 1 year ago

don't create spans / run instrumentation at all if it's not enabled

I totally agree with you. Discarding the data is an easy solution. I'll try to find a way to disable it eventually. But no guraranteed time could be given.

wu-sheng commented 1 year ago

don't create spans / run instrumentation at all if it's not enabled

I totally agree with you. Discarding the data is an easy solution. I'll try to find a way to disable it eventually. But no guraranteed time could be given.

@LinuxSuRen Thanks for pinging me on this. I have followed the discussion, and want to add a little more. SkyWalking go agent will swtich to no tracing mode(https://github.com/apache/skywalking-go/blob/main/plugins/core/tracing.go#L229), which only create a noop span without further operations.

This should have reduced the cost as much as possible. Although in theory, it still cost a little more than no agent, due to stack operations(more methods invoked, and a new noop span).

cc SkyWalking go agent lead @mrproliu

agilgur5 commented 1 year ago

Saw this thread in the #argo-contributors Slack channel today and Argo CD apparently implemented OTel already. https://github.com/argoproj/argo-cd/pull/7539 was linked there, but that seems to be for gRPC specifically and it looks like there might be more code around it. We'd definitely want to stay as consistent as possible between Argo projects