elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
126 stars 135 forks source link

Linux docker agent gets Unhealthy on adding linux integration. #2377

Open amolnater-qasource opened 1 year ago

amolnater-qasource commented 1 year ago

Kibana version: 8.7 BC6 Kibana cloud environment

Host OS: Ubuntu 22 ARM64

Build details: VERSION: 8.7 BC6 BUILD: 61051 COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb

Preconditions:

  1. 8.7 BC6 Kibana cloud environment should be available.
  2. Docker setup should be done.

Steps to reproduce:

  1. Install a docker agent using below command:
    sudo docker run \
    --env FLEET_ENROLL=1 \
    --env FLEET_URL=<Fleet Server host URL> \
    --env FLEET_ENROLLMENT_TOKEN=<enrollment token>
    --rm docker.elastic.co/staging/elastic-agent:8.7.0-a7fb3750
  2. Add linux integration to this policy and observe agent goes Unhealthy.

Note:

Expected Result: Docker agent should remain healthy on adding linux integration.

Screen Recording:

https://user-images.githubusercontent.com/77374876/226260250-08be140b-97e5-4f95-a4ff-65581dbeeede.mp4

Logs: elastic-agent-diagnostics-2023-03-16T17-37-58Z-00.zip elastic-agent-diagnostics-2023-03-16T17-43-24Z-00.zip

amolnater-qasource commented 1 year ago

@manishgupta-qasource Please review.

manishgupta-qasource commented 1 year ago

Secondary review for this ticket is Done

jlind23 commented 1 year ago

@amolnater-qasource Looks like we faced some permission issues: {"log.level":"error","@timestamp":"2023-03-16T17:33:32.349Z","message":"Error fetching data for metricset linux.pageinfo: error opening file: open /proc/pagetypeinfo: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"linux/metrics-default","type":"linux/metrics"},"log":{"source":"linux/metrics-default"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

@fearful-symmetry does it ring a bell or should I ask the obs-service team to look at this specific integration first?

fearful-symmetry commented 1 year ago

@jlind23 There could be a few issues here; The original issue mentions docker, so it's possible that we need to set hostfs correctly and insure that /proc/pagetypeinfo is mounted into the container as /hostfs/proc/pagetypeinfo. It's also possible that /proc/pagetypeinfo does not exist on this particular OS at all.

jlind23 commented 1 year ago

@amolnater-qasource could you please check what @fearful-symmetry said? On a side note, was this particular docker distribution working before now?

amolnater-qasource commented 1 year ago

Hi @fearful-symmetry @jlind23

Thank you for looking into this issue.

We observed /proc/pagetypeinfo is setup on the the using VM. Could you please confirm how we can check if it is mounted into the container?

Further, this issue was earlier observed during 8.5.0 SNAPSHOT testing, reported under https://github.com/elastic/elastic-agent/issues/1454 However later this was working fine on 8.6 BC10.

Please let us know if we are missing anything here. Thanks!

jlind23 commented 1 year ago

@amolnater-qasource can't you ssh in this container and see if it is mounted? Are you relying on a different base docker image?

fearful-symmetry commented 1 year ago

Ah, brain skipped a beat, just noticed that it's actually a permissions error: /proc/pagetypeinfo: permission denied

I'm fairly certain that pagetypeinfo is one of those procfs files that's going to be the same as the host from within the container, which means it's not strictly necessary to mount it, and you can read from /proc/pagetypeinfo from within the container to monitor the host, but the permission error is a bit odd. Since the original issue mentions docker, my assumption is that there's a docker setup issue, and the metricbeat instance running in docker somehow doesn't have the proper permissions, or isn't running as root.

amolnater-qasource commented 1 year ago

Hi @jlind23

For testing the docker agent we followed below steps:

  1. Setup Ubuntu 22.04 ARM architecture VM.
  2. Installed Docker on the VM.
  3. Directly ran below command:
    sudo docker run \
    --env FLEET_ENROLL=1 \
    --env FLEET_URL=<Fleet Server host URL> \
    --env FLEET_ENROLLMENT_TOKEN=<enrollment token>
    --rm docker.elastic.co/staging/elastic-agent:8.7.0-a7fb3750

So, as per our understanding we aren't creating any new container for this and we are using this docker image for installing an agent.

Please let us know if we are missing anything here. Thanks

jlind23 commented 1 year ago

@fearful-symmetry would be great to have your eyes on this as soon as you have time to make sure this is not a regression we introduced in metricbeat.

amolnater-qasource commented 1 year ago

Hi Team,

We have revalidated this issue on latest 8.8 BC6 Kibana cloud environment and found it still reproducible.

Observations:

Screenshot: image

Logs: elastic-agent-diagnostics-2023-05-19T08-18-10Z-00.zip

Build details:

VERSION: 8.8.0 BC6 Kibana cloud environment
BUILD: 63115
COMMIT: a4c256b39f7d1ee34abe61109a817ec7f5329009
Docker artifact: --rm docker.elastic.co/staging/elastic-agent:8.8.0-375abdf7 

Please let us know if anything else is required from our end.

Thanks!

cmacknz commented 1 year ago

This is a new error in the system metrics input:

- id: system/metrics-default
  state:
    state: 2
    message: 'Healthy: communicating with pid ''32'''
    units:
      ? unittype: 0
        unitid: system/metrics-default-system/metrics-system-aa6c87f0-f61c-11ed-b6d2-0b368c0c212a
      : state: 4
        message: '[failed to reloading inputs: 2 errors: Error creating runner from
          config: 1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
          connect: no such file or directory; Error creating runner from config: 1
          error: error connecting to dbus: error getting connection to system bus:
          dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'
cmacknz commented 1 year ago

@amolnater-qasource can you try to reproduce? I want to see if this error happens every time or is intermittent to assess the severity of the problem.

fearful-symmetry commented 1 year ago

@cmacknz normally that error would be thrown by the linux/users or linux/services metricsets on systems that don't support dbus. Do we know if this is running on a supported OS?

amolnater-qasource commented 1 year ago

Hi @cmacknz

Thank you for looking into this.

The issue is reproducible everytime the linux integration with all datasets enabled is added to the agent policy.

Agents: Docker Agent

Host OS's:

Build details:

VERSION: 8.8 BC8 Kibana cloud environment
BUILD: 63142
COMMIT: 2973fcc10d985e4ab94e5eeef976aad0046c6cce

Logs: elastic-agent-diagnostics-2023-05-24T06-05-09Z-00.zip

Please let us know if anything else is required from our end. cc: @fearful-symmetry

Thanks!

cmacknz commented 1 year ago

@fearful-symmetry yes this is supported, we support both Ubuntu 22 and Google container optimized OS on ARM64 per https://www.elastic.co/support/matrix

As of 7.16+ releases, we support aarch64 on Linux with the same set of distributions as x86_64

Raising priority, adding to the next sprint since this happens every time.

fearful-symmetry commented 1 year ago

Going to look into this more tomorrow, but what I think is happening is that because we're running in a container, the dbus socket for the host isn't reachable inside the container. Pretty sure there's an environment variable we can set that's used by the coreos libraries. I don't think this is documented anywhere, which is a bit of a problem.

jlind23 commented 1 year ago

Thanks @fearful-symmetry for looking into this. If you assumption is right, putting a doc PR would definitely be enough for this.

fearful-symmetry commented 1 year ago

@amolnater-qasource Can you try:

amolnater-qasource commented 1 year ago

Hi @fearful-symmetry

Thank you for sharing the details over slack and helping us revalidating this.

Please find below details for the attempted test: On running below command:

sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=************************** \
--env FLEET_ENROLLMENT_TOKEN=************************** \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--rm docker.elastic.co/beats/elastic-agent:8.9.0-3cc641a9-SNAPSHOT

We observed that the installed agent is Unhealthy and had below errors: image image image image

Agent Logs: elastic-agent-diagnostics-2023-05-30T17-36-46Z-00.zip

Please let us know if anything else is required from our end. Thanks!

fearful-symmetry commented 1 year ago

Update while I look into this: I think there's some kind of formatting issue with the env var happening between the --env command in docker, or I'm just confused by how the dbus library works. Will investigate further.

fearful-symmetry commented 1 year ago

Alright, found the issue, extremely dumb bug. There's two different versions of the godbus/dbus library at work, one we're using directly and another that was imported by another library we're using. They use two different formats for the DBUS_SYSTEM_BUS_ADDRESS, so either format would just break at different points.

Fix is here: https://github.com/elastic/beats/pull/35618

amolnater-qasource commented 1 year ago

Hi Team,

We have revalidated this issue on latest 8.9.0 BC3 Kibana cloud environment and found it still reproducible.

Observations:

Build details: VERSION: 8.9.0 BC3 BUILD: 64584 COMMIT: fc463b96275c55dc44524f79f617b0026b7f8667

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=***********************3 \
--env FLEET_ENROLLMENT_TOKEN=************************== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Screen Recording:

https://github.com/elastic/elastic-agent/assets/77374876/7385aed6-61e1-4a48-b6ec-5db30062104a

https://github.com/elastic/elastic-agent/assets/77374876/41f687dc-ad3f-49e5-bc57-ed7bc9939cd3

Logs: elastic-agent-diagnostics-2023-07-11T05-37-28Z-00.zip

Hence, we are reopening this issue. Thanks!

pierrehilbert commented 1 year ago

@fearful-symmetry could you please have a look?

cmacknz commented 1 year ago

Seems like this is dbus again:

- id: system/metrics-default
  state:
    state: 2
    message: 'Healthy: communicating with pid ''31'''
    units:
      ? unittype: 0
        unitid: system/metrics-default-system/metrics-system-331804e9-c84e-40e0-beae-805672378572
      : state: 4
        message: '[failed to reload inputs: 2 errors: Error creating runner from config:
          1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
          connect: no such file or directory; Error creating runner from config: 1
          error: error connecting to dbus: error getting connection to system bus:
          dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'
      ? unittype: 0

https://github.com/elastic/beats/pull/35618 was supposed to fix this I believe.

fearful-symmetry commented 1 year ago

@amolnater-qasource is that the exact docker command? If you're using the dbus-related metricsets you need to add --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \ as well as set the DBUS_SYSTEM_BUS_ADDRESS env var to /hostfs/var/run/dbus/system_bus_socket.

I suspect this isn't well documented; going to hunt around the system docs and see if I can find where we should put this.

fearful-symmetry commented 1 year ago

Alright, tested with

docker run --volume=$(pwd)/metricbeat.reference.yml:/usr/share/metricbeat/metricbeat.yml \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
 --mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs

Seems to work fine.

jlind23 commented 1 year ago

Closing this as fixed then and I approved your doc Pr. @amolnater-qasource csn we make sure the test case is updated with this command?

amolnater-qasource commented 1 year ago

Hi @fearful-symmetry @jlind23

Thank you for the confirmation and adding the docs.

We have re-attempted to install agent on docker with below updated commands: First:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Second:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Screen Recording:

https://github.com/elastic/elastic-agent/assets/77374876/d2777921-11ca-487f-91a6-90bc80db792e

For troubleshooting we also tried adding below config to linux integration. image

However, the agent still remained Unhealthy.

Logs: elastic-agent-diagnostics-2023-07-12T04-46-06Z-00.zip

Please let us know if we are missing anything here.

Thank you

fearful-symmetry commented 1 year ago

A little baffled by this, since I'm seeing tons of errors that seem to suggest that the hostfs flag is set, but the actual directory isn't mounted in: network io counters: open /hostfs/proc/net/dev: no such file or directory disk io counters: open /hostfs/proc/diskstats disk io counters: open /hostfs/proc/diskstats: no such file or directory error getting entropy: error reading from random: open /hostfs/proc/sys/kernel/random/entropy_avail: no such file or directory

We might want to take care to create the policy with hostfs set first, then run the agent in docker with the proper mounts, and see what happens, or at least collect another diagnostic bundle if it continues to not work.

amolnater-qasource commented 1 year ago

Hi @fearful-symmetry

Thank you for looking into this again. Yes, we have added hostfs to the policy first and then run the agent in docker.

For getting the logs we have reattempted with two different set of commands for running agent: First Command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Debug Logs for this agent are: elastic-agent-diagnostics-2023-07-13T04-28-36Z-00.zip

Second Command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0

Agent logs for this agent are: elastic-agent-diagnostics-2023-07-13T06-24-57Z-00.zip

Screenshot: image

Please let us know if we are missing anything here.

Thanks!

fearful-symmetry commented 1 year ago

Ah, there we go:

{"log.level":"error","@timestamp":"2023-07-13T04:19:27.011Z","message":"Error creating runner from config: 1 error: error connecting to dbus: error in Hello: An AppArmor policy prevents this sender from sending this message to this recipient; type=\"method_call\", sender=\"(null)\" (inactive) interface=\"org.freedesktop.DBus\" member=\"Hello\" error name=\"(unset)\" requested_reply=\"0\" destination=\"org.freedesktop.DBus\" (bus)","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":138,"file.name":"cfgfile/list.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt","ecs.version":"1.6.0"}

It looks like AppArmor is stopping the dbus Hello message, which isn't something I think I've ever seen before. @amolnater-qasource can you tell me precisely what ubuntu release this is so I can try and document some kind of workaround? The output of uname -a should be enough.

amolnater-qasource commented 1 year ago

Hi @fearful-symmetry

Please find below exact host details: image

Further it is deployed from AWS- Ubuntu 22.04 with ARM64 architecture. image

Please let us know if anything else is required from our end.

Thanks!

fearful-symmetry commented 1 year ago

Huzzah, was able to reproduce this. Interestingly, this only seems to happen with docker, which is probably why we haven't seen this before.

fearful-symmetry commented 1 year ago

So, we can temporarily work around this by adding --security-opt apparmor=unconfined to the beginning of the docker run:

docker run --security-opt apparmor=unconfined --volume=$(pwd)/metricbeat.yml:/usr/share/metricbeat/metricbeat.yml --mount type=bind,source=/proc,target=/hostfs/proc,readonly  --mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly  --mount type=bind,source=/,target=/hostfs,readonly --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket --env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' --net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs

This doesn't seem like the best solution, and I'd like to come up with a more targeted apparmor role.

jlind23 commented 4 months ago

@amolnater-qasource Is this still an issue you face?

amolnater-qasource commented 4 months ago

Hi @jlind23

We have revalidated this issue on latest 8.14.0 BC5 kibana cloud environment and found it still reproducible with the actual command:

docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://<url>cloud.com:443 \
--env FLEET_ENROLLMENT_TOKEN=Q<token>9DUQ== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.14.0-eeda34a5

Observations:

Agent Logs: elastic-agent-diagnostics-2024-05-28T08-53-03Z-00.zip

Screenshot: image

We were expecting this to fix as per https://github.com/elastic/elastic-agent/issues/2377#issuecomment-1642673298

Please let us know if anything else is required from our end.

Thanks!

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 4 months ago

Yes this is the same error originally detected in https://github.com/elastic/elastic-agent/issues/2377#issuecomment-1559432181.