Closed exsver closed 1 year ago
Sorry, can't easily load all your logs here, can you check your lxd.log
to see if you're getting errors related to the memory cgroup?
Sorry, can't easily load all your logs here, can you check your
lxd.log
to see if you're getting errors related to the memory cgroup?
no errors by keyword "cgroup" in /var/snap/lxd/common/lxd/logs/lxd.log
part of container's log:
... lxc ct01 20221002063501.582 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9 lxc ct01 20221002063501.582 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13 lxc ct01 20221002063501.582 TRACE mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler lxc ct01 20221002063501.582 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler" lxc ct01 20221002063501.582 TRACE mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler lxc 20221002063501.584 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response lxc 20221002063501.584 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 23 for command "get_state" lxc 20221002063501.584 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9 lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13 lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler" lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler lxc 20221002063501.585 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response lxc 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_state" lxc 20221002063501.585 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9 lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_limit_cgroup_fd; cleaning up client fd 13 lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler" lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler lxc 20221002063501.585 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:137 - Received exact number of file descriptors 1 == 1 for command "get_limit_cgroup_fd" lxc 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_limit_cgroup_fd" lxc 20221002063501.585 TRACE cgfsng - ../src/src/lxc/cgroups/cgfsng.c:cgroup_get:4035 - Reading memory.max from unified cgroup hierarchy lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9 lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13 lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler lxc ct01 20221002063501.585 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler" lxc ct01 20221002063501.585 TRACE mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler lxc 20221002063501.586 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response lxc 20221002063501.586 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_state" lxc 20221002063501.586 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state lxc ct01 20221002063501.586 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9 lxc ct01 20221002063501.586 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_limit_cgroup_fd; cleaning up client fd 13 lxc ct01 20221002063501.586 TRACE mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler lxc ct01 20221002063501.586 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler" lxc ct01 20221002063501.586 TRACE mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler lxc 20221002063501.586 DEBUG commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:137 - Received exact number of file descriptors 1 == 1 for command "get_limit_cgroup_fd" lxc 20221002063501.586 TRACE commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 37 for command "get_limit_cgroup_fd" lxc 20221002063501.586 TRACE cgfsng - ../src/src/lxc/cgroups/cgfsng.c:cgroup_get:4035 - Reading memory.max from unified cgroup hierarchy ...
When metrics are absent cg.rw.Get(version, "memory", "memory.stat") returns ("", nil)
https://github.com/lxc/lxd/blob/master/lxd/cgroup/abstraction.go#L842 stats, err = cg.rw.Get(version, "memory", "memory.stat")
If you want I can provide ssh-access to test-server
It would be interesting to track down the actual memory.stat
file for the container and see if it's the kernel returning bad or empty data here.
You can get the container's PID with lxc info NAME
, then after that look at /proc/PID/cgroup
to get the cgroup path and finally look for that under /sys/fs/cgroup/memory
to access the actual memory.stat
file.
If that also shows up empty occasionally, then there's nothing we can do about it.
It would be interesting to track down the actual
memory.stat
file for the container and see if it's the kernel returning bad or empty data here.You can get the container's PID with
lxc info NAME
, then after that look at/proc/PID/cgroup
to get the cgroup path and finally look for that under/sys/fs/cgroup/memory
to access the actualmemory.stat
file.If that also shows up empty occasionally, then there's nothing we can do about it.
I tested this. The kernel always returns the correct memory.stat file.
Sorry, can't easily load all your logs here, can you check your
lxd.log
to see if you're getting errors related to the memory cgroup?
Truncated logs: 2022-10-02_03-00_03-30_lxd.log 2022-10-02_06-30_06-40_lxd.log
2022-10-02_06-30_06-40_ct01.log 2022-10-02_03-00_03-20_ct05.log 2022-10-02_03-00_03-30_ct10.log 2022-10-02_03-00_03-20_ct19.log
We tested it on 4 different servers with ubuntu 22.04 and lxd snap version with cgroup v2, and every server had metric drops. There are no messages in debug logs, so lxd didnt see any errors. On some servers there were drops in 1 container every 1-3 hours. On some - every 15-30 minutes with 1-4 drops. We spent 2 weeks to solve this, but failed,
Hi @MaximMonin we've made some improvements to the cgroup handling for metrics in recent releases, please can you confirm if you still experience this on lxd >= 5.15
Thanks
Right now we rewrited our metric collector w/o using lxd api. So we do not use it in production. Maybe we try it again later.
OK, I'll close this for now, but do let us know if you experience the issue again and we can reopen.
Required information
Issue description
Sometimes API call /1.0/metrics (https://linuxcontainers.org/lxd/docs/latest/api/#/metrics/metrics_get) do not return memory.stat metrics for one or more random containers.
See the output in the attachments.
in file metrics_2022-10-02_03-10 are absent the following metrics:
in file metrics_2022-10-02_03-22 are absent the following metrics:
in file metrics_2022-10-02_06:35 are absent the following metrics:
Steps to reproduce
Information to attach
metrics_2022-10-02_03-10.txt metrics_2022-10-02_03-22.txt metrics_2022-10-02_06-35.txt
lxd_logs.partaa.tar.gz lxd_logs.partab.tar.gz lxd_logs.partac.tar.gz lxd_logs.partad.tar.gz lxd_logs.partae.tar.gz lxd_logs.partaf.tar.gz lxd_logs.partag.tar.gz to join tar.gz: