Open amolnater-qasource opened 1 year ago
@manishgupta-qasource Please review.
Secondary review for this ticket is Done
@amolnater-qasource Looks like we faced some permission issues:
{"log.level":"error","@timestamp":"2023-03-16T17:33:32.349Z","message":"Error fetching data for metricset linux.pageinfo: error opening file: open /proc/pagetypeinfo: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"linux/metrics-default","type":"linux/metrics"},"log":{"source":"linux/metrics-default"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
@fearful-symmetry does it ring a bell or should I ask the obs-service team to look at this specific integration first?
@jlind23 There could be a few issues here; The original issue mentions docker, so it's possible that we need to set hostfs
correctly and insure that /proc/pagetypeinfo
is mounted into the container as /hostfs/proc/pagetypeinfo
. It's also possible that /proc/pagetypeinfo
does not exist on this particular OS at all.
@amolnater-qasource could you please check what @fearful-symmetry said? On a side note, was this particular docker distribution working before now?
Hi @fearful-symmetry @jlind23
Thank you for looking into this issue.
We observed /proc/pagetypeinfo
is setup on the the using VM. Could you please confirm how we can check if it is mounted into the container?
Further, this issue was earlier observed during 8.5.0 SNAPSHOT testing, reported under https://github.com/elastic/elastic-agent/issues/1454 However later this was working fine on 8.6 BC10.
Please let us know if we are missing anything here. Thanks!
@amolnater-qasource can't you ssh in this container and see if it is mounted? Are you relying on a different base docker image?
Ah, brain skipped a beat, just noticed that it's actually a permissions error: /proc/pagetypeinfo: permission denied
I'm fairly certain that pagetypeinfo
is one of those procfs files that's going to be the same as the host from within the container, which means it's not strictly necessary to mount it, and you can read from /proc/pagetypeinfo
from within the container to monitor the host, but the permission error is a bit odd. Since the original issue mentions docker, my assumption is that there's a docker setup issue, and the metricbeat instance running in docker somehow doesn't have the proper permissions, or isn't running as root.
Hi @jlind23
For testing the docker agent we followed below steps:
sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=<Fleet Server host URL> \
--env FLEET_ENROLLMENT_TOKEN=<enrollment token>
--rm docker.elastic.co/staging/elastic-agent:8.7.0-a7fb3750
So, as per our understanding we aren't creating any new container for this and we are using this docker image for installing an agent.
Please let us know if we are missing anything here. Thanks
@fearful-symmetry would be great to have your eyes on this as soon as you have time to make sure this is not a regression we introduced in metricbeat.
Hi Team,
We have revalidated this issue on latest 8.8 BC6 Kibana cloud environment and found it still reproducible.
Observations:
Screenshot:
Logs: elastic-agent-diagnostics-2023-05-19T08-18-10Z-00.zip
Build details:
VERSION: 8.8.0 BC6 Kibana cloud environment
BUILD: 63115
COMMIT: a4c256b39f7d1ee34abe61109a817ec7f5329009
Docker artifact: --rm docker.elastic.co/staging/elastic-agent:8.8.0-375abdf7
Please let us know if anything else is required from our end.
Thanks!
This is a new error in the system metrics input:
- id: system/metrics-default
state:
state: 2
message: 'Healthy: communicating with pid ''32'''
units:
? unittype: 0
unitid: system/metrics-default-system/metrics-system-aa6c87f0-f61c-11ed-b6d2-0b368c0c212a
: state: 4
message: '[failed to reloading inputs: 2 errors: Error creating runner from
config: 1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
connect: no such file or directory; Error creating runner from config: 1
error: error connecting to dbus: error getting connection to system bus:
dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'
@amolnater-qasource can you try to reproduce? I want to see if this error happens every time or is intermittent to assess the severity of the problem.
@cmacknz normally that error would be thrown by the linux/users
or linux/services
metricsets on systems that don't support dbus. Do we know if this is running on a supported OS?
Hi @cmacknz
Thank you for looking into this.
The issue is reproducible everytime the linux integration with all datasets enabled is added to the agent policy.
Agents: Docker Agent
Host OS's:
Build details:
VERSION: 8.8 BC8 Kibana cloud environment
BUILD: 63142
COMMIT: 2973fcc10d985e4ab94e5eeef976aad0046c6cce
Logs: elastic-agent-diagnostics-2023-05-24T06-05-09Z-00.zip
Please let us know if anything else is required from our end. cc: @fearful-symmetry
Thanks!
@fearful-symmetry yes this is supported, we support both Ubuntu 22 and Google container optimized OS on ARM64 per https://www.elastic.co/support/matrix
As of 7.16+ releases, we support aarch64 on Linux with the same set of distributions as x86_64
Raising priority, adding to the next sprint since this happens every time.
Going to look into this more tomorrow, but what I think is happening is that because we're running in a container, the dbus socket for the host isn't reachable inside the container. Pretty sure there's an environment variable we can set that's used by the coreos libraries. I don't think this is documented anywhere, which is a bit of a problem.
Thanks @fearful-symmetry for looking into this. If you assumption is right, putting a doc PR would definitely be enough for this.
@amolnater-qasource Can you try:
/var/run/dbus/system_bus_socket
socket into the agent container under test, just as /hostfs/var/run/dbus/system_bus_socket
DBUS_SYSTEM_BUS_ADDRESS
environment variable to the above mount pointHi @fearful-symmetry
Thank you for sharing the details over slack and helping us revalidating this.
Please find below details for the attempted test: On running below command:
sudo docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=************************** \
--env FLEET_ENROLLMENT_TOKEN=************************** \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--rm docker.elastic.co/beats/elastic-agent:8.9.0-3cc641a9-SNAPSHOT
We observed that the installed agent is Unhealthy and had below errors:
Agent Logs: elastic-agent-diagnostics-2023-05-30T17-36-46Z-00.zip
Please let us know if anything else is required from our end. Thanks!
Update while I look into this: I think there's some kind of formatting issue with the env var happening between the --env
command in docker, or I'm just confused by how the dbus library works. Will investigate further.
Alright, found the issue, extremely dumb bug. There's two different versions of the godbus/dbus
library at work, one we're using directly and another that was imported by another library we're using. They use two different formats for the DBUS_SYSTEM_BUS_ADDRESS
, so either format would just break at different points.
Fix is here: https://github.com/elastic/beats/pull/35618
Hi Team,
We have revalidated this issue on latest 8.9.0 BC3 Kibana cloud environment and found it still reproducible.
Observations:
Build details: VERSION: 8.9.0 BC3 BUILD: 64584 COMMIT: fc463b96275c55dc44524f79f617b0026b7f8667
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=***********************3 \
--env FLEET_ENROLLMENT_TOKEN=************************== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
Screen Recording:
https://github.com/elastic/elastic-agent/assets/77374876/7385aed6-61e1-4a48-b6ec-5db30062104a
https://github.com/elastic/elastic-agent/assets/77374876/41f687dc-ad3f-49e5-bc57-ed7bc9939cd3
Logs: elastic-agent-diagnostics-2023-07-11T05-37-28Z-00.zip
Hence, we are reopening this issue. Thanks!
@fearful-symmetry could you please have a look?
Seems like this is dbus again:
- id: system/metrics-default
state:
state: 2
message: 'Healthy: communicating with pid ''31'''
units:
? unittype: 0
unitid: system/metrics-default-system/metrics-system-331804e9-c84e-40e0-beae-805672378572
: state: 4
message: '[failed to reload inputs: 2 errors: Error creating runner from config:
1 error: error connecting to dbus: dial unix /var/run/dbus/system_bus_socket:
connect: no such file or directory; Error creating runner from config: 1
error: error connecting to dbus: error getting connection to system bus:
dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory]'
? unittype: 0
https://github.com/elastic/beats/pull/35618 was supposed to fix this I believe.
@amolnater-qasource is that the exact docker command? If you're using the dbus-related metricsets you need to add --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
as well as set the DBUS_SYSTEM_BUS_ADDRESS
env var to /hostfs/var/run/dbus/system_bus_socket
.
I suspect this isn't well documented; going to hunt around the system docs and see if I can find where we should put this.
Alright, tested with
docker run --volume=$(pwd)/metricbeat.reference.yml:/usr/share/metricbeat/metricbeat.yml \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs
Seems to work fine.
Closing this as fixed then and I approved your doc Pr. @amolnater-qasource csn we make sure the test case is updated with this command?
Hi @fearful-symmetry @jlind23
Thank you for the confirmation and adding the docs.
We have re-attempted to install agent on docker with below updated commands: First:
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
Second:
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
Screen Recording:
https://github.com/elastic/elastic-agent/assets/77374876/d2777921-11ca-487f-91a6-90bc80db792e
For troubleshooting we also tried adding below config to linux integration.
However, the agent still remained Unhealthy.
Logs: elastic-agent-diagnostics-2023-07-12T04-46-06Z-00.zip
Please let us know if we are missing anything here.
Thank you
A little baffled by this, since I'm seeing tons of errors that seem to suggest that the hostfs
flag is set, but the actual directory isn't mounted in:
network io counters: open /hostfs/proc/net/dev: no such file or directory
disk io counters: open /hostfs/proc/diskstats
disk io counters: open /hostfs/proc/diskstats: no such file or directory
error getting entropy: error reading from random: open /hostfs/proc/sys/kernel/random/entropy_avail: no such file or directory
We might want to take care to create the policy with hostfs
set first, then run the agent in docker with the proper mounts, and see what happens, or at least collect another diagnostic bundle if it continues to not work.
Hi @fearful-symmetry
Thank you for looking into this again. Yes, we have added hostfs to the policy first and then run the agent in docker.
For getting the logs we have reattempted with two different set of commands for running agent: First Command:
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49axxxxxxxxxxxxxxxxxxxoud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjeVBfekw4dEFxxxxxxxxxxxx2NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--mount type=bind,source=/proc,target=/hostfs/proc,readonly \
--mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly \
--mount type=bind,source=/,target=/hostfs,readonly \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
Debug Logs for this agent are: elastic-agent-diagnostics-2023-07-13T04-28-36Z-00.zip
Second Command:
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://49a4c592f08bxxxxxxxxxxxxxxxxxp.cloud.es.io:443 \
--env FLEET_ENROLLMENT_TOKEN=RUlOcFE0a0JjexxxxxxxxxxxxxxxxxxxxU4Nk82NVZWZw== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket \
--env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' \
--rm docker.elastic.co/staging/elastic-agent:8.9.0-0d830bd0
Agent logs for this agent are: elastic-agent-diagnostics-2023-07-13T06-24-57Z-00.zip
Screenshot:
Please let us know if we are missing anything here.
Thanks!
Ah, there we go:
{"log.level":"error","@timestamp":"2023-07-13T04:19:27.011Z","message":"Error creating runner from config: 1 error: error connecting to dbus: error in Hello: An AppArmor policy prevents this sender from sending this message to this recipient; type=\"method_call\", sender=\"(null)\" (inactive) interface=\"org.freedesktop.DBus\" member=\"Hello\" error name=\"(unset)\" requested_reply=\"0\" destination=\"org.freedesktop.DBus\" (bus)","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":138,"file.name":"cfgfile/list.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"centralmgmt","ecs.version":"1.6.0"}
It looks like AppArmor is stopping the dbus Hello message, which isn't something I think I've ever seen before. @amolnater-qasource can you tell me precisely what ubuntu release this is so I can try and document some kind of workaround? The output of uname -a
should be enough.
Hi @fearful-symmetry
Please find below exact host details:
Further it is deployed from AWS- Ubuntu 22.04 with ARM64 architecture.
Please let us know if anything else is required from our end.
Thanks!
Huzzah, was able to reproduce this. Interestingly, this only seems to happen with docker, which is probably why we haven't seen this before.
So, we can temporarily work around this by adding --security-opt apparmor=unconfined
to the beginning of the docker run
:
docker run --security-opt apparmor=unconfined --volume=$(pwd)/metricbeat.yml:/usr/share/metricbeat/metricbeat.yml --mount type=bind,source=/proc,target=/hostfs/proc,readonly --mount type=bind,source=/sys/fs/cgroup,target=/hostfs/sys/fs/cgroup,readonly --mount type=bind,source=/,target=/hostfs,readonly --volume /var/run/dbus/system_bus_socket:/hostfs/var/run/dbus/system_bus_socket --env DBUS_SYSTEM_BUS_ADDRESS='unix:path=/hostfs/var/run/dbus/system_bus_socket' --net=host docker.elastic.co/beats/metricbeat:8.9.0-SNAPSHOT -e --system.hostfs=/hostfs
This doesn't seem like the best solution, and I'd like to come up with a more targeted apparmor role.
@amolnater-qasource Is this still an issue you face?
Hi @jlind23
We have revalidated this issue on latest 8.14.0 BC5 kibana cloud environment and found it still reproducible with the actual command:
docker run \
--env FLEET_ENROLL=1 \
--env FLEET_URL=https://<url>cloud.com:443 \
--env FLEET_ENROLLMENT_TOKEN=Q<token>9DUQ== \
--env ELASTIC_AGENT_TAGS=docker,qa \
--rm docker.elastic.co/staging/elastic-agent:8.14.0-eeda34a5
Observations:
Agent Logs: elastic-agent-diagnostics-2024-05-28T08-53-03Z-00.zip
Screenshot:
We were expecting this to fix as per https://github.com/elastic/elastic-agent/issues/2377#issuecomment-1642673298
Please let us know if anything else is required from our end.
Thanks!
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Yes this is the same error originally detected in https://github.com/elastic/elastic-agent/issues/2377#issuecomment-1559432181.
Kibana version: 8.7 BC6 Kibana cloud environment
Host OS: Ubuntu 22 ARM64
Build details: VERSION: 8.7 BC6 BUILD: 61051 COMMIT: 04ef24287f26854ad99a46ae983854c6184717cb
Preconditions:
Steps to reproduce:
Note:
Expected Result: Docker agent should remain healthy on adding linux integration.
Screen Recording:
https://user-images.githubusercontent.com/77374876/226260250-08be140b-97e5-4f95-a4ff-65581dbeeede.mp4
Logs: elastic-agent-diagnostics-2023-03-16T17-37-58Z-00.zip elastic-agent-diagnostics-2023-03-16T17-43-24Z-00.zip