DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Setup agent to support OLTP ingest does not work. #11664

Open samholder opened 2 years ago

samholder commented 2 years ago

Output of the info page (if this is a bug)

Don't think this is a bug, just unsure how to configure correctly

Describe what happened:

This is running on Windows machine with agent installed (7.32.4.1) which reports in the logs

Listening for traces at http://localhost:8126

and our application is sending traces via APM currently.

I am trying to add support for open telemetry alongside the current APM stuff.

I followed the instructions here: https://docs.datadoghq.com/tracing/setup_overview/open_standards/#otlp-ingest-in-datadog-agent and added this config to the datadog.yml:

experimental:
  otlp:
    receiver:
      protocols:
        grpc:
        http:

When I restarted the agent I saw this in the logs:

2022-04-13 15:10:48 BST | TRACE | WARN | (pkg/util/log/log.go:630 in func1) | Unknown key in config file: experimental.otlp.receiver.protocols.grpc 2022-04-13 15:10:48 BST | TRACE | WARN | (pkg/util/log/log.go:630 in func1) | Unknown key in config file: experimental.otlp.receiver.protocols.http

I searched little and found this issue: https://github.com/DataDog/helm-charts/issues/529

which seems to imply this this feature is no longer experimental, however I was not able to get this to function. Things I tried:

using this config in the agent datadog.yml:

otlp:
  receiver:
    protocols:
      grpc:
      http:

but same error as above basically.

Setting the following environment variables (based on other issue)

OTEL_EXPORTER_OTLP_ENDPOINT to http://localhost:4317 DD_OTLP_HTTP_PORT to 4317 DD_OTLP_GRPC_PORT to 4318 OTLP_COLLECTOR to http://localhost:4317

but after restarting the agent I see no other messages apart from the:

Listening for traces at http://localhost:8126

which implies this did not work. I also tried the app just in case.

The app is a .net 6 web app with this configration:

 services.AddOpenTelemetryTracing(
                builder =>
                {
                    builder
                        .SetSampler(new AlwaysOnSampler())
                        .AddSource("MySource")
                        .SetResourceBuilder(
                            ResourceBuilder.CreateDefault()
                                .AddService(serviceName: "MyService", serviceVersion: "1.0.0"))
                        .AddOtlpExporter(config =>
                        {
                            config.Endpoint = new Uri("http://localhost:4317");
                        })
                        .AddNServiceBusInstrumentation();
                });

Describe what you expected:

That following the instructions for enabling OTLP ingest would work correctly, or some alternative instructions are available if the feature is not experimental any more.

Steps to reproduce the issue: As above

Additional environment details (Operating System, Cloud provider, etc):

locally hosted Windows Server 2019VM Datadog Agent 7.32.4.1 Datadog .NET tracer 64 bit 1.27.1

samholder commented 2 years ago

After a bit more investigation if I enable debug logging then I get this in the logs:

2022-04-14 12:19:37 BST | TRACE | DEBUG | (pkg/trace/api/otlp.go:97 in Start) | OpenTelemetry gRPC receiver running on localhost:5003 (internal use only)

and if I configure things to point at this endpoint then I get some traces being logged, but they appear to get tagged as internal, so I assume this is not the right thing to be doing...

tomerfriedman commented 2 years ago

On the same boat here. Can't get this to work. Seems like I can't access these ports (4318, 4317). However, I can access 8126 just fine.

guizmaii commented 2 years ago

Same here.

I can see that my traces are correctly exported from my app to the agent: I have a "debug trace" sent every 2s. Because the OpenTelemtry OtlpGrpcSpanExporter class I'm using is using OkHttp underneath and because the DD APM automatically instruments OkHttp, I can see the traces of the exports in the Datadog UI. These traces are tracing the HTTP calls made by the internal of OtlpGrpcSpanExporter and I can see that these calls response are 200.

Screen Shot 2022-04-15 at 12 40 28 pm

But then, I cannot see anything in the Agent logs nor in the Datadog UI.

Datadog Agent v7.35.0 Datadog APM v0.99.0 OpenTelemetry Java lib (io.opentelemetry.opentelemetry-exporter-otlp) v1.13.0

Here are the logs of my agent regarding OLTP:

"2022-04-14T11:39:06.000Z","2022-04-14 11:39:06 UTC | CORE | INFO | (pkg/util/log/log.go:572 in func1) | runtime: final GOMAXPROCS value is: 4"
"2022-04-14T11:39:06.000Z","2022-04-14 11:39:06 UTC | CORE | WARN | (pkg/util/log/log.go:587 in func1) | OTLP ingest configuration is now stable and has been moved out of the ""experimental"" section. This section will be removed in the 7.37 Datadog Agent release. Please use the ""otlp_config"" section instead.The DD_OTLP_GRPC_PORT and DD_OTLP_HTTP_PORT environment variables will also be removed in 7.37; set the full endpoint instead."
"2022-04-14T11:39:06.000Z","2022-04-14 11:39:06 UTC | CORE | WARN | (pkg/util/log/log.go:592 in func1) | failed to get configuration value for key ""experimental.otlp"": unable to cast <nil> of type <nil> to map[string]interface{}"
"2022-04-14T11:39:06.000Z","2022-04-14 11:39:06 UTC | CORE | INFO | (pkg/util/log/log.go:572 in func1) | Features detected from environment: containerd,kubernetes,docker,cri"
"2022-04-14T11:39:06.000Z","2022-04-14 11:39:06 UTC | CORE | INFO | (cmd/agent/app/run.go:249 in StartAgent) | Starting Datadog Agent v7.35.0"
...
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/exporters_builder.go:255 in buildExporter) | kind:exporter,name:otlp | Exporter was built."
...

"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/receivers_builder.go:226 in attachReceiverToPipelines) | kind:receiver,name:otlp,datatype:traces | Receiver was built."
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/receivers_builder.go:226 in attachReceiverToPipelines) | kind:receiver,name:otlp,datatype:metrics | Receiver was built."
...

"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/exporters_builder.go:48 in Start) | kind:exporter,name:otlp | Exporter started."
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/exporters_builder.go:40 in Start) | kind:exporter,name:otlp | Exporter is starting..."
...

"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/receivers_builder.go:73 in StartAll) | kind:receiver,name:otlp | Receiver started."
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/receiver/otlpreceiver/otlp.go:87 in startHTTPServer) | kind:receiver,name:otlp | Starting HTTP server on endpoint 0.0.0.0:55681"
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/receiver/otlpreceiver/otlp.go:147 in startProtocolServers) | kind:receiver,name:otlp | Setting up a second HTTP listener on legacy endpoint 0.0.0.0:55681"
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/receiver/otlpreceiver/otlp.go:87 in startHTTPServer) | kind:receiver,name:otlp | Starting HTTP server on endpoint 0.0.0.0:4318"
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/receiver/otlpreceiver/otlp.go:69 in startGRPCServer) | kind:receiver,name:otlp | Starting GRPC server on endpoint 0.0.0.0:4317"
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | WARN | ([zap@v1.20.0](mailto:zap@v1.20.0)/sugar.go:107 in Warn) | grpc_log:true | grpc: addrConn.createTransport failed to connect to {localhost:5003 <nil> 0 <nil>}. Err: connection error: desc = ""transport: Error while dialing dial tcp 127.0.0.1:5003: connect: connection refused"". Reconnecting..."
"2022-04-14T11:39:07.000Z","2022-04-14 11:39:07 UTC | CORE | INFO | ([collector@v0.44.0](mailto:collector@v0.44.0)/service/internal/builder/receivers_builder.go:68 in StartAll) | kind:receiver,name:otlp | Receiver is starting..."
...
pj-datadog commented 2 years ago

Hi, I am a Product Manager at Datadog. We recently declared our OTLP Ingest in Datadog Agent (sending telemetry data from OTel SDK to DD Agent) as stable/GA, and is available with agent version 7.35. A very clear and descriptive documentation for this feature will be available in our public docs in the mid-next week. This should solve the setup problems you are facing but in case further support is needed, I will also be happy to connect you to our engineering team. Thank you for choosing Datadog as your APM vendor of choice!

guizmaii commented 2 years ago

For anyone reading this issue, I figured my issue out.

The traces were correctly sent to Datadog. My configuration (explained here: https://github.com/DataDog/helm-charts/issues/529#issuecomment-1099421478) works perfectly well.

I wasn't finding my traces because I was looking for them in the env: staging in the APM UI of Datadog (where I can see all the other traces of my app), while they were reported in the env: none.

To fix this env issue, I configured the ResourceAttributes.DEPLOYMENT_ENVIRONMENT in my SdkTracerProvider instance:

val serviceName = "my-service"
val env = "staging"
val resource = 
  Resource
    .builder()
    .put(ResourceAttributes.SERVICE_NAME, serviceName)
    .put(ResourceAttributes.DEPLOYMENT_ENVIRONMENT, environment)
    .build()

...    

val tracerProvider = 
  SdkTracerProvider
   .builder()
   .addSpanProcessor(spanProcessor)
   .setResource(resource)
   .build()    
iamharvey commented 2 years ago

Hi, I am a Product Manager at Datadog. We recently declared our OTLP Ingest in Datadog Agent (sending telemetry data from OTel SDK to DD Agent) as stable/GA, and is available with agent version 7.35. A very clear and descriptive documentation for this feature will be available in our public docs in the mid-next week. This should solve the setup problems you are facing but in case further support is needed, I will also be happy to connect you to our engineering team. Thank you for choosing Datadog as your APM vendor of choice!

What is the difference between Datadog's otel agent and standard otel collector? Currently we are using otel-collector where the exporters are set to datadog. Will the release influences our current use?

duxing commented 2 years ago

hi folks, I've spent a good amount of time working on this topic recently and I'd like to share what I've found out here.

TL;DR; version of the issue: agent listens on 4317/4318(otlp receiver) correctly but the traces were not sent to the trace-agent (on internal port 5003)

my setup:

before attempting using OTLP ingestion on datadog-agent, my application was auto-instrumented with opentelemetry-exporter-datadog (being deprecated): I was able to see APM traces on my datadog account

when I replaced opentelemetry-exporter-datadog with opentelemetry-exporter-otlp (pointing to the 4317 port), I lost APM traces on my datadog account. same API key was used compared to before and the rest of application code remains the same

some details for opentelemetry-exporter-otlp implementation:

what I observed:

since the agent ingest otlp and export to trace-agent in otlp, I tried pointing my application otlp exporter to port 5003 instead and this failed too.

I've set DD_LOG_LEVEL to trace and inspected the /var/log/datadog/agent.log and /var/log/datadog/trace-agent.log but nothing looks suspicious. the only thing related is a "connection refused" error message when agent is trying to connect to trace-agent before trace-agent starts listening on the internal grpc port.

would be great to have a working example published from DataDog to help us understand what's missing for trace

pj-datadog commented 2 years ago

@iamharvey Today there are two methods a customer can use to send their telemetry data to Datadog.

Method 1: OTLP Ingest in Datadog Agent - A way to send telemetry data from OTel SDKs directly to Datadog Agent,

Method 2: OTel Collector Datadog Exporter - A way to send telemetry data from OTel SDKs to OTel Collector, which exports the data to Datadog Backend via a Datadog Exporter.

If you are using OTel Collector Datadog Exporter method, the release (GAing OTLP Ingest) will not influence your use.

NOTE: I am happy to announce that OTLP Ingest in Datadog Agent is now GA/Stable with Datadog Agent version 7.35

guizmaii commented 2 years ago

I am happy to announce that OTLP Ingest in Datadog Agent is now GA/Stable with Datadog Agent version 7.35

Not officially available yet in the Helm Chart ;)

duxing commented 2 years ago

@pj-datadog I love the idea of method 1 to make it easier to adopt. I've tested this setup but ran into an issue with traces (reported in #11737 ) and it seems like I'm not alone.

I have a gut feeling that this is caused by misconfiguration instead of a bug in datadog-agent. would be great to have more examples / documentation to refer to

pj-datadog commented 2 years ago

@duxing Can you please reach out to Datadog Support and open a zendesk ticket there. I asked my engineering team to look at your use-case, we feel we will need a debug flare.

would be great to have more examples / documentation to refer to

What kind of documentation do you have in mind? I can work with you to have that in our public documentation if you think that will be helpful for the larger community. Thank You!

duxing commented 2 years ago

@pj-datadog a support ticket has been open since a few days ago: https://help.datadoghq.com/hc/en-us/requests/789265?page=1

details are cross-referenced

we feel we will need a debug flare I don't think you need that from me. see README from the repo I linked in the referred issue #11737 once you git clone the repo, you should be able to get what you need.

What kind of documentation do you have in mind? mainly examples I guess. the only example I found was the gist that @gbbr provides (for golang). more examples supporting more languages would be really nice to have.

mackjmr commented 2 years ago

@guizmaii currently you can set this manually via the environment variables in helm (https://docs.datadoghq.com/tracing/setup_overview/open_standards/otlp_ingest_in_the_agent/?tab=kuberneteshelm), but we are actively working on a dedicated configuration section to make this easier

mx-psi commented 2 years ago

As I said over at #11737, thank you @duxing for the detailed comments and repro, it was really helpful. @duxing's example with the patch from duxing/datadog-otlp#1 is a working example of how to use OTLP ingest with traces on a containerized setting.

tohadar commented 2 years ago

I tried to use method2 with the section of 'Alongside Datadog Agent' and it didn't work. Tried to deploy opentelemetry as Damonset alongside Datadog agent Damonset but wasn't able to succeed since both Damonset try to listen on the same port of the hosts(4317,4318). Also tried to deploy opentelemtry as deployment and to use the otlp exporter:

exporters: otlp: endpoint: "${HOST_IP}:4317"

But got tls errors. Did anyone was able to configure it and send traces?

mx-psi commented 2 years ago

Note that you don't need to deploy the OpenTelemetry Collector to use OpenTelemetry. You can just use the Datadog Agent if you want and point your application to send telemetry data to the Agent.

Also tried to deploy opentelemtry as deployment and to use the otlp exporter: But got tls errors.

If you still want to use the Collector and the Agent, you can disable TLS by doing this:

exporters:
  otlp:
    endpoint: "${HOST_IP}:4317"
    tls:
      insecure: true

This should be safe to do if communication happens locally.

tohadar commented 2 years ago

Note that you don't need to deploy the OpenTelemetry Collector to use OpenTelemetry. You can just use the Datadog Agent if you want and point your application to send telemetry data to the Agent.

Also tried to deploy opentelemtry as deployment and to use the otlp exporter: But got tls errors.

If you still want to use the Collector and the Agent, you can disable TLS by doing this:

exporters:
  otlp:
    endpoint: "${HOST_IP}:4317"
    tls:
      insecure: true

This should be safe to do if communication happens locally.

tls.insecure: true in opentelemetry helm chart? Also, if I don't want to disable the TLS what should I do?

adding the error we get: }. Err: connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake" {"grpc_log": true}

vivere-dally commented 1 year ago

Can the agent also ingest simple metrics (i.e., counters)?

TimoSchmechel commented 1 year ago

@duxing did you deploy the DD agent via helm?

I am currently facing the same issue, but only after changing over to the Datadog Operator. My previous setup was, DD agent via helm which received otlp from the OTEL collector. All I did was change the DD agent to be deployed via the DatadogAgent resource from the operator and now the agent no longer forwards traces that it receives from the collector.

I've "fixed" the issue by just exporting to DD directly by using the DD exporter inside the OTEL collector.

duxing commented 1 year ago

@TimoSchmechel not for my set up. I'm running my setup locally with docker-compose and deploying the same change through helm to staging/production environment would have been the next step.

I think the issue is in the agent image/binary itself, not the chart. however, there's not a good way to reproduce this issue. the test project (setup with docker) I setup consistently surface this issue on my end, but did not yield any error on datadog side.

dagnabbitall commented 1 year ago

In my case the issue was the trace-agent (component that listens on 5003/tcp) is disabled by default when using the Datadog operator. Traces were making it to the external otlp endpoints (4317/4318) on the agent, then getting blackholed at the trace-agent port (5003).

Setting features.apm.enabled to true in my DatadogAgent manifest turned on the trace-agent container and fixed my issue.

@pj-datadog this may be helpful to document here for those of us that didn't already have APM enabled: https://docs.datadoghq.com/opentelemetry/otlp_ingest_in_the_agent/?tab=host

imo it doesn't really make sense that you can turn on the otlp receiver without also turning on the trace-agent.

changhyuni commented 3 months ago

Has this issue been resolved? I followed your example for helm chart but it doesn't work

      datadog:
        otlp:
          receiver:
            protocols:
              http:
                enabled: true
                endpoint: "0.0.0.0:4318"
                useHostPort: true

I can see only this log

UTC | TRACE | DEBUG | (pkg/trace/api/otlp.go:91 in Start) | Listening to core Agent for OTLP traces on internal gRPC port (http://0.0.0.0:5003, internal use only). Check core Agent logs for information on the OTLP ingest status.

Looking at the pod manifest, it looks like you have the port specification and variables set up well.