canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

After upgrade to 5.4 some ct don't have USER FQDN and DISK USAGE in output of lxc list #10705

Closed gribchenko closed 2 years ago

gribchenko commented 2 years ago

Required information

Issue description

After upgrade to 5.4 some ct don't have USER FQDN and DISK USAGE in output of lxc list -c p,user.fqdn,nD +---------+---------------------+--------+------------+ | PID | USER FQDN | NAME | DISK USAGE | +---------+---------------------+--------+------------+ | 1733 | es206..net | ct3029 | 40.29GiB | +---------+---------------------+--------+------------+ | 4248 | es360..net | ct3093 | 23.83GiB | +---------+---------------------+--------+------------+ | 5682 | vs2141..net | ct3095 | 1.80GiB | +---------+---------------------+--------+------------+ | 7414 | vs2142..net | ct3105 | 13.31GiB | +---------+---------------------+--------+------------+ | 12161 | vs282..net | ct3141 | 7.50GiB | +---------+---------------------+--------+------------+ | 13615 | vs2144..net | ct3160 | 5.45GiB | +---------+---------------------+--------+------------+ | 15423 | vs2145.*.net | ct3163 | 2.63GiB | +---------+---------------------+--------+------------+ | 18865 | | ct3179 | | +---------+---------------------+--------+------------+ | 20697 | | ct3271 | | +---------+---------------------+--------+------------+ | 22448 | | ct3286 | | +---------+---------------------+--------+------------+ | 23574 | | ct3310 | | +---------+---------------------+--------+------------+ | 24547 | | ct3360 | | +---------+---------------------+--------+------------+ | 27561 | | ct3377 | | +---------+---------------------+--------+------------+ | 28742 | | ct3381 | |

Steps to reproduce

  1. Step one lxc list -c p,user.fqdn,nD
    1. Step two ct3179 don't have "USER FQDN & DISK USAGE"
  2. Step three

Information to attach

Resources: Processes: 106 Disk usage: root: 8.29GiB CPU usage: CPU usage (in seconds): 317023 Memory usage: Memory (current): 893.89MiB Memory (peak): 2.11GiB Network usage: eth0: Type: broadcast State: UP Host interface: vlan210 MAC address: 76:d8:b6:9b:67:c9 MTU: 1500 Bytes received: 45.78GB Bytes sent: 6.11GB Packets received: 38740154 Packets sent: 39084301 IP addresses: inet: ..*.129/32 (global) lo: Type: loopback State: UP MTU: 65536 Bytes received: 2.20GB Bytes sent: 2.20GB Packets received: 9749391 Packets sent: 9749391 IP addresses: inet: 127.0.0.1/8 (local)

Log:

tomponline commented 2 years ago

what is the storage pool type? please can you show output of lxc storage show <pool>

Is it an LVM thinpool?

gribchenko commented 2 years ago

lxc storage show default

config: rsync.compression: "false" source: /var/lib/lxd/storage-pools/default description: Default DIR storage backend name: default driver: dir used_by:

gribchenko commented 2 years ago

New update...

Problem exist when node have more than 256 containers. If less. Everything ok.

tomponline commented 2 years ago

Are there any warnings/errors in the logs? /var/snap/lxd/common/lxd/logs/lxd.log?

MaximMonin commented 2 years ago

https://github.com/lxc/lxd/blob/master/lxd/db/instances.go#L335

there aren't any errors in lxd.log lxc query -X GET --wait /1.0/containers?recursion=2 output with <256 containters

        {
                "architecture": "x86_64",
                "backups": null,
                "config": {
                        "boot.autostart": "true",
                        "image.architecture": "x86_64",
                        "image.description": "Debian 11 (Bullseye) eVPS containe                                                                                                                                                                             r",
                        "image.os": "debian",
                        "image.release": "bullseye",
                        "security.idmap.base": "64648500",
                        "user.fqdn": "xxx",
                        "volatile.base_image": "845ba18a6d01676d020e5c6c15d52c75                                                                                                                                                                             ab9e0ef87fa84f4d025ad508be6ab32b",
                        "volatile.cloud-init.instance-id": "4cb26631-264b-4151-b                                                                                                                                                                             066-0dc5a03d8e29",
                        "volatile.eth0.host_name": "lxd2fcd84e6",
                        "volatile.eth0.last_state.created": "false",
                        "volatile.idmap.base": "64648500",
                        "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":fa                                                                                                                                                                             lse,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536},{\"Isuid\":false,\"Isgid\                                                                                                                                                                             ":true,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536}]",
                        "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false                                                                                                                                                                             ,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536},{\"Isuid\":false,\"Isgid\":t                                                                                                                                                                             rue,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536}]",
                        "volatile.last_state.idmap": "[{\"Isuid\":true,\"Isgid\"                                                                                                                                                                             :false,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536},{\"Isuid\":false,\"Isg                                                                                                                                                                             id\":true,\"Hostid\":64648500,\"Nsid\":0,\"Maprange\":65536}]",
                        "volatile.last_state.power": "RUNNING",
                        "volatile.uuid": "e8cdea3e-4894-4d47-8209-9dead7316aa4"
                },
                "created_at": "2022-05-09T12:31:32.758011665Z",
                "description": "",
                "devices": {
                        "eth0": {
                                "ipv4.address": "xxx",
                                "ipv4.host_table": "4209",
                                "ipv6.host_table": "6209",
                                "mtu": "1500",
                                "name": "eth0",
                                "nictype": "ipvlan",
                                "parent": "vlan209",
                                "type": "nic"
                        }
                },

output with 256+ containers:

        },
        {
                "architecture": "x86_64",
                "backups": null,
                "config": null,
                "created_at": "2022-05-04T14:56:08.115232574Z",
                "description": "",
                "devices": {},
                "ephemeral": false,
                "last_used_at": "2022-07-28T07:31:22.079843942Z",
                "location": "none",
                "name": "ct9924",
                "profiles": [],
                "project": "default",
                "snapshots": null,

I checked query optimization, because 5.3 instance list was very slow, sql queries use IN (instanceList).....

output of: lxc query -X GET --wait /1.0/containers/ct9924?recursion=2 is correct

tomponline commented 2 years ago

Interesting. I looked at the sqlite default limits for query placeholders:

https://www.sqlite.org/limits.html

See point 9:

the maximum value of a host parameter number is SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

So we should be OK, unless this is a dqlite limitation cc @MathieuBordere

tomponline commented 2 years ago

I would have expected to get a query error too if we were exceeding the limit somehow.

MaximMonin commented 2 years ago

we just generated 255 empty containers and adding and removing one container, to reproduce bug.

tomponline commented 2 years ago

Thanks, I'll try to reproduce here, but initially this feels like it could be a dqlite bug. Lets see.

MaximMonin commented 2 years ago

as a side affect on reboot with 256+ containers: "boot.autostart": "true" is invisible and we have to start containers manually

tomponline commented 2 years ago

Yeah its not good. I think a workaround is to build the IN statement manually without using query placeholders. Working on this now.

tomponline commented 2 years ago

I'm going to test up to 512 instances.

tomponline commented 2 years ago

Got a potential fix ^

tomponline commented 2 years ago

@MaximMonin I've tested my fix up to 512 instances and it seems to work fine.

MaximMonin commented 2 years ago

we are going to rebuild now our new debian package with this hotfix, and test it

tomponline commented 2 years ago

we are going to rebuild now our new debian package with this hotfix, and test it

Thanks!

tomponline commented 2 years ago

@MaximMonin you only need https://github.com/lxc/lxd/pull/10706/commits/a0a49a91a6a8702cf645675daaac9f3175fb9c8b the others are additional cleanup/improvements I noticed.

MaximMonin commented 2 years ago

we rebuilded debian package with 5.4.tar.gz as source and patched /lxd/db/instances.go It seems lxc list and lxc query are working ok now.

Thanks!

tomponline commented 2 years ago

Excellent thanks for testing.