canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

memory.stat metrics error #10981

Closed exsver closed 1 year ago

exsver commented 1 year ago

Required information

Issue description

Sometimes API call /1.0/metrics (https://linuxcontainers.org/lxd/docs/latest/api/#/metrics/metrics_get) do not return memory.stat metrics for one or more random containers.

See the output in the attachments.

in file metrics_2022-10-02_03-10 are absent the following metrics:

in file metrics_2022-10-02_03-22 are absent the following metrics:

in file metrics_2022-10-02_06:35 are absent the following metrics:

Steps to reproduce

  1. Install fresh Ubuntu 22.04 on the server.
  2. Install lasts stable lxd (via snap).
  3. Run containers. More is better.
  4. Grab metrics from /1.0/metrics every minute.

Information to attach

metrics_2022-10-02_03-10.txt metrics_2022-10-02_03-22.txt metrics_2022-10-02_06-35.txt

lxd_logs.partaa.tar.gz lxd_logs.partab.tar.gz lxd_logs.partac.tar.gz lxd_logs.partad.tar.gz lxd_logs.partae.tar.gz lxd_logs.partaf.tar.gz lxd_logs.partag.tar.gz to join tar.gz:

 cat lxd_logs.parta*.tar.gz >lxd_logs.tar.gz
 
stgraber commented 1 year ago

Sorry, can't easily load all your logs here, can you check your lxd.log to see if you're getting errors related to the memory cgroup?

exsver commented 1 year ago

Sorry, can't easily load all your logs here, can you check your lxd.log to see if you're getting errors related to the memory cgroup?

no errors by keyword "cgroup" in /var/snap/lxd/common/lxd/logs/lxd.log

part of container's log:

...
lxc ct01 20221002063501.582 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9
lxc ct01 20221002063501.582 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13
lxc ct01 20221002063501.582 TRACE    mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler
lxc ct01 20221002063501.582 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler"
lxc ct01 20221002063501.582 TRACE    mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler
lxc 20221002063501.584 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response
lxc 20221002063501.584 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 23 for command "get_state"
lxc 20221002063501.584 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler"
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler
lxc 20221002063501.585 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response
lxc 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_state"
lxc 20221002063501.585 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_limit_cgroup_fd; cleaning up client fd 13
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler"
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler
lxc 20221002063501.585 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:137 - Received exact number of file descriptors 1 == 1 for command "get_limit_cgroup_fd"
lxc 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_limit_cgroup_fd"
lxc 20221002063501.585 TRACE    cgfsng - ../src/src/lxc/cgroups/cgfsng.c:cgroup_get:4035 - Reading memory.max from unified cgroup hierarchy
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_state; cleaning up client fd 13
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler
lxc ct01 20221002063501.585 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler"
lxc ct01 20221002063501.585 TRACE    mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler
lxc 20221002063501.586 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:148 - Command "get_state" received response
lxc 20221002063501.586 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 32 for command "get_state"
lxc 20221002063501.586 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_get_state:1063 - Container "ct01" is in "RUNNING" state
lxc ct01 20221002063501.586 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_accept:2102 - Accepted new client as fd 13 on command server fd 9
lxc ct01 20221002063501.586 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_handler:2065 - Processed command get_limit_cgroup_fd; cleaning up client fd 13
lxc ct01 20221002063501.586 TRACE    mainloop - ../src/src/lxc/mainloop.c:disarm_handler:65 - Disarmed 13 for "lxc_cmd_handler" handler
lxc ct01 20221002063501.586 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd_cleanup_handler:2005 - Closing client fd 13 for "lxc_cmd_cleanup_handler"
lxc ct01 20221002063501.586 TRACE    mainloop - ../src/src/lxc/mainloop.c:delete_handler:79 - Deleted 13 for "lxc_cmd_handler" handler
lxc 20221002063501.586 DEBUG    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:137 - Received exact number of file descriptors 1 == 1 for command "get_limit_cgroup_fd"
lxc 20221002063501.586 TRACE    commands - ../src/src/lxc/commands.c:lxc_cmd:521 - Opened new command socket connection fd 37 for command "get_limit_cgroup_fd"
lxc 20221002063501.586 TRACE    cgfsng - ../src/src/lxc/cgroups/cgfsng.c:cgroup_get:4035 - Reading memory.max from unified cgroup hierarchy
...

When metrics are absent cg.rw.Get(version, "memory", "memory.stat") returns ("", nil)

https://github.com/lxc/lxd/blob/master/lxd/cgroup/abstraction.go#L842 stats, err = cg.rw.Get(version, "memory", "memory.stat")

If you want I can provide ssh-access to test-server

stgraber commented 1 year ago

It would be interesting to track down the actual memory.stat file for the container and see if it's the kernel returning bad or empty data here.

You can get the container's PID with lxc info NAME, then after that look at /proc/PID/cgroup to get the cgroup path and finally look for that under /sys/fs/cgroup/memory to access the actual memory.stat file.

If that also shows up empty occasionally, then there's nothing we can do about it.

exsver commented 1 year ago

It would be interesting to track down the actual memory.stat file for the container and see if it's the kernel returning bad or empty data here.

You can get the container's PID with lxc info NAME, then after that look at /proc/PID/cgroup to get the cgroup path and finally look for that under /sys/fs/cgroup/memory to access the actual memory.stat file.

If that also shows up empty occasionally, then there's nothing we can do about it.

I tested this. The kernel always returns the correct memory.stat file.

exsver commented 1 year ago

Sorry, can't easily load all your logs here, can you check your lxd.log to see if you're getting errors related to the memory cgroup?

Truncated logs: 2022-10-02_03-00_03-30_lxd.log 2022-10-02_06-30_06-40_lxd.log

2022-10-02_06-30_06-40_ct01.log 2022-10-02_03-00_03-20_ct05.log 2022-10-02_03-00_03-30_ct10.log 2022-10-02_03-00_03-20_ct19.log

MaximMonin commented 1 year ago

We tested it on 4 different servers with ubuntu 22.04 and lxd snap version with cgroup v2, and every server had metric drops. There are no messages in debug logs, so lxd didnt see any errors. On some servers there were drops in 1 container every 1-3 hours. On some - every 15-30 minutes with 1-4 drops. We spent 2 weeks to solve this, but failed,

tomponline commented 1 year ago

Hi @MaximMonin we've made some improvements to the cgroup handling for metrics in recent releases, please can you confirm if you still experience this on lxd >= 5.15

Thanks

MaximMonin commented 1 year ago

Right now we rewrited our metric collector w/o using lxd api. So we do not use it in production. Maybe we try it again later.

tomponline commented 1 year ago

OK, I'll close this for now, but do let us know if you experience the issue again and we can reopen.