DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.9k stars 1.21k forks source link

Agent manually installed on Docker (Debian) but traces not handled #2182

Closed mblasi closed 6 years ago

mblasi commented 6 years ago

Describe what happened:

I'm trying to make agent traces work on app engine flex environment. My instances run debian 9. I get the agent it installed using the manual steps:

RUN apt-get update
RUN apt-get install -y gnupg apt-transport-https
RUN sh -c "echo 'deb https://apt.datadoghq.com/ stable 6' > /etc/apt/sources.list.d/datadog.list"
RUN apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 382E94DE
ADD datadog.yaml /etc/datadog-agent/datadog.yaml
RUN apt-get update
RUN apt-get install datadog-agent

The agent starts, but it looks like no traces are handled, I see this in my application log:

2018/08/19 00:01:25 errors.go:72: Datadog Exporter error: Post http://localhost:8126/v0.3/traces: dial tcp [::1]:8126: connect: connection refused (x2)

I tried to diagnose the status by adding a last line in the installation process (Dockerfile):

RUN datadog-agent status

But it fails:

Error: Get https://localhost:5001/agent/status: dial tcp 127.0.0.1:5001: connect: connection refused

Describe what you expected:

I expect to have the agent and traces running on my app instances.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

OS: Debian 9 Cloud: Google cloud app engine flexible Runtime: custom Language: golang

My datadog.yaml enables the traces:

apm_config:
  enabled: true
  receiver_port: 8126

My Dockerfile exports every port I think should:

EXPOSE 8125/udp
EXPOSE 8126/tcp
EXPOSE 5001/tcp
mblasi commented 6 years ago

Here is the agent installation log:

Step 18/28 : RUN sh -c "echo 'deb https://apt.datadoghq.com/ stable 6' > /etc/apt/sources.list.d/datadog.list"
 ---> Running in 939e75ec4a5f
Removing intermediate container 939e75ec4a5f
 ---> c671baafc875
Step 19/28 : RUN apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 382E94DE
 ---> Running in 3723eaf21433
Warning: apt-key output should not be parsed (stdout is not a terminal)
Executing: /tmp/apt-key-gpghome.uFtL7Vggj3/gpg.1.sh --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 382E94DE
gpg: key D3A80E30382E94DE: public key "Datadog, Inc <package@datadoghq.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Removing intermediate container 3723eaf21433
 ---> 151fd230c251
Step 20/28 : ADD datadog.yaml /etc/datadog-agent/datadog.yaml
 ---> 5a932e2f4ace
Step 21/28 : RUN apt-get update
 ---> Running in ed731c33645f
Hit:1 http://security.debian.org stretch/updates InRelease
Ign:2 http://cdn-fastly.deb.debian.org/debian stretch InRelease
Ign:3 https://apt.datadoghq.com stable InRelease
Get:4 https://apt.datadoghq.com stable Release [4525 B]
Get:5 https://apt.datadoghq.com stable Release.gpg [819 B]
Hit:6 http://cdn-fastly.deb.debian.org/debian stretch-updates InRelease
Get:7 https://apt.datadoghq.com stable/6 amd64 Packages [4015 B]
Hit:8 http://cdn-fastly.deb.debian.org/debian stretch Release
Fetched 9359 B in 0s (16.0 kB/s)
Reading package lists...
Removing intermediate container ed731c33645f
 ---> cb99c33b7e51
Step 22/28 : RUN apt-get install datadog-agent
 ---> Running in 57bc77d041f5
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  datadog-agent
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 99.7 MB of archives.
After this operation, 341 MB of additional disk space will be used.
Get:1 https://apt.datadoghq.com stable/6 amd64 datadog-agent amd64 1:6.4.2-1 [99.7 MB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 99.7 MB in 1s (63.0 MB/s)
Selecting previously unselected package datadog-agent.
(Reading database ... 7128 files and directories currently installed.)
Preparing to unpack .../datadog-agent_1%3a6.4.2-1_amd64.deb ...
Unpacking datadog-agent (1:6.4.2-1) ...
Setting up datadog-agent (1:6.4.2-1) ...
Creating dd-agent group
Creating dd-agent user
Enabling service datadog-agent
(Re)starting datadog-agent now...
Removing intermediate container 57bc77d041f5
 ---> 2ae3a59e8736
hkaj commented 6 years ago

Hi @mblasi Do you start the agent when your container starts? Try running the same status command after datadog-agent run, it should work.

mblasi commented 6 years ago

Hi @hkaj ,

I tried lots of ways... every one with a different issue.

Now I'm trying with building the client from the sources, and I get system metrics reported (mem, cpu, net) but traces are still not handled (same error reported here), looks like apm.enabled is not true, but it is!

I think the biggest help from you could be: which way should be the recommended for my scenario: GAE Flex environment??? (From datadog support answered me that it is not supported up to now, but I think I should be able to start the agent with a custom runtime).

The 4 ways I'm trying are:

1 - The official one-step installation (https://app.datadoghq.com/account/settings#agent/debian). It looks like systemd needed, and the gae debian docker image doesn't have. 2 - The official manual installation (https://app.datadoghq.com/account/settings#agent/debian). Service starts, service metrics reported, but no traces handled. 3 - Building the agent from the sources (following https://github.com/DataDog/datadog-agent), client looks like started but nothing reported. 4 - Run the agent in a different docker container (https://docs.datadoghq.com/tracing/setup/docker/) I think I should create a new service of my GAE and configure the docker network in both services, and the traces hostname in my app service... not tried yet.

The most working way looks like the 2nd, but the traces are not handled:

2018/08/19 00:01:25 errors.go:72: Datadog Exporter error: Post http://localhost:8126/v0.3/traces: dial tcp [::1]:8126: connect: connection refused (x2)

My datadog.yaml has the apm.enabled=true... I don't know why the agent is not handling the /traces resource... could it be related to any docker networking internal??

Before continue loosing time between this ways, let me know which one should be for my scenario.

Regards, Matías.

hkaj commented 6 years ago

@mblasi support is right, GAE Flex is not supported as of today.

If you want to give it a shot, the solution you started with in this issue (# 2 I believe?) is probably your best option. Solution 4 would work as well if you setup networking between app containers and the agent container, but traces, logs, and metrics will have the wrong hostname attached (probably the datadog-agent container ID? I'm not familiar enough with GAE to tell for sure).

I think what was missing in your first message was that the agent was not running. Try running the command I sent earlier, that should start the infra agent. And /opt/datadog-agent/embedded/bin/trace-agent -config /etc/datadog-agent/datadog.yaml should run the trace agent.

mblasi commented 6 years ago

Great. I'll try.

Just for clarifying, the trace agent isn't embbeded in the datadog-agent??? I read that in the datadog-trace-agent project. As far as I understand, if my datadog.yaml has the apm.enabled = true, it should "start" de trace agent, isn't it?

mblasi commented 6 years ago

Hi, here is an update:

After running /opt/datadog-agent/embedded/bin/trace-agent -config /etc/datadog-agent/datadog.yaml the traces are now handled! :)

Just two doubts:

1 - Should I start it by hand? Isn't it embbedded in datadog-agent??? (https://github.com/DataDog/datadog-trace-agent here says it is included in datadog-agent) 2 - It is still missing this two reports info: https://app.datadoghq.com/process?columns=host,process,user,cpu,memory,start&options=normalizeCPU,showArguments&sort=memory,DESC# and https://app.datadoghq.com/containers?columns=container_name,container_cpu,container_memory,container_net_sent_bps,container_net_rcvd_bps,container_status,container_started&options=normalizeCPU&sort=container_memory,DESC# What am I missing?

Here is the datadog-agent status ouput:

matias@aef-default-20180820t161712-jcgz:~$ docker exec gaeapp datadog-agent status
Getting the status from the agent.

==============
Agent (v6.4.2)
==============

  Status date: 2018-08-20 19:53:21.197366 UTC
  Pid: 7
  Python Version: 
  Logs: 
  Check Runners: 1
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -5.1117e-05 s
    System UTC time: 2018-08-20 19:53:21.197366 UTC

  Host Info
  =========
    bootTime: 2018-08-20 19:25:24.000000 UTC
    kernelVersion: 4.9.0-7-amd64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.5
    procs: 60
    uptime: 48
    virtualizationRole: guest
    virtualizationSystem: docker

  Hostnames
  =========
    host_aliases: [aef-default-20180820t161712-jcgz.weshipit-today]
    hostname: aef-default-20180820t161712-jcgz.c.weshipit-today.internal
    socket-fqdn: a7f5c51f9c88
    socket-hostname: a7f5c51f9c88

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 108
      Metric Samples: 6, Total: 642
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    disk (1.2.0)
    ------------
      Total Runs: 108
      Metric Samples: 98, Total: 10584
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 31ms

    file_handle
    -----------
      Total Runs: 108
      Metric Samples: 1, Total: 108
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    io
    --
      Total Runs: 108
      Metric Samples: 26, Total: 2790
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    load
    ----
      Total Runs: 108
      Metric Samples: 6, Total: 648
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    memory
    ------
      Total Runs: 108
      Metric Samples: 17, Total: 1836
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    network (1.6.0)
    ---------------
      Total Runs: 108
      Metric Samples: 20, Total: 2160
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

    ntp
    ---
      Total Runs: 108
      Metric Samples: 1, Total: 108
      Events: 0, Total: 0
      Service Checks: 1, Total: 108
      Average Execution Time : 0ms

    uptime
    ------
      Total Runs: 108
      Metric Samples: 1, Total: 108
      Events: 0, Total: 0
      Service Checks: 0, Total: 0
      Average Execution Time : 0ms

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 108
  Dropped: 0
  DroppedOnInput: 0
  Errors: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 9
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 225
  TimeseriesV1: 108

  API Keys status
  ===============
    https://6-4-2-app.agent.datadoghq.com,*************************4b2f4: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 20715
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 108
  Series Flushed: 12665
  Service Check: 974
  Dogstatsd Metric Sample: 793
mblasi commented 6 years ago

Ok, these views are working now, it is necessary to start /opt/datadog-agent/embedded/bin/process-agent by hand also.

Is there any other process to start? datadog-agent, trace-agent, process-agent, is there any list of 'agents' documented? I didn't find it.

Regards, Matías.

hkaj commented 6 years ago

1 - Should I start it by hand? Isn't it embbedded in datadog-agent??? (https://github.com/DataDog/datadog-trace-agent here says it is included in datadog-agent)

Documentation is not explicit on this, but it's included in the standard packaging of the Datadog Agent. The image we provide takes care of running it if you enable it in the config file, as well as the process agent. Since you're installing the agent manually in your container things are a bit more manual in this case. Note that the logs agent is included in the datadog-agent binary itself so it is started by default. There is no other agent you need to start.

Seems like your issue is solved, closing it but feel free to comment or reach out to support if you need more help.

Thanks

mblasi commented 6 years ago

Thank you @hkaj

macmule commented 5 years ago

I've hit this myself, just now..

So the "fix" is to manually start parts of the agent?

XinCai commented 3 years ago

I've met same issue. I manually install datadog-agent in Dockerfile, when i ssh into running container, check datadog agent status, it gave me this error message

Here is the datadog-agent status ouput:

==========
Logs Agent
==========

  Logs Agent is not running

=========
APM Agent
=========
  Status: Not running or unreachable on localhost:8126.
  Error: Get "http://localhost:8126/debug/vars": dial tcp 127.0.0.1:8126: connect: connection refused

After running this command, apm agent is started and running:

/opt/datadog-agent/embedded/bin/trace-agent -config /etc/datadog-agent/datadog.yaml >/dev/null 2>&1 &

==========
Logs Agent
==========

  Logs Agent is not running

=========
APM Agent
=========
  Status: Running
  Pid: 1748
  Uptime: 3 seconds
  Mem alloc: 7,667,200 bytes
  Hostname: 8667609f1847
  Receiver: localhost:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ===================================
Aggregator
=========
  Checks Metric Sample: 12,570
  Dogstatsd Metric Sample: 839
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 32
  Series Flushed: 10,110
  Service Check: 325
  Service Checks Flushed: 348
=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 838
  Metric Parse Errors: 0
  Service Check Packets: 62
  Service Check Parse Errors: 0
  Udp Bytes: 157,432
  Udp Packet Reading Errors: 0
  Udp Packets: 338
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
  Unterminated Metric Errors: 0

you need to start these service manually.