canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 931 forks source link

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

Open webdock-io opened 5 months ago

webdock-io commented 5 months ago

Ubuntu Jammy LXD v5.0.3 and LXD 5.21.1

Running lxc profile list on a system with lots of profiles results in the following:

root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded

Running lxd sql global .dump returns almost immediately and lists all data in the database

We have a real use case for supporting a lot of profiles in a remote (we allow our customers to build their own)

Adding and deleting individual profiles seems to work, although it's hard to confirm deletion when we can't list them with lxd.

Is there any way to increase the timeout in lxd to allow for listing of our (large, and will only grow larger) profiles list? We could start hacking away at sql queries, but I'd much rather be able to do an lxc profile list

(this use case came up as we actually wanted to make sure the list was cleaned up so any unused profiles were removed)

tomponline commented 5 months ago

This is probably due to inefficient queries. Lets look into making fewer queries that return the info needed.

tomponline commented 5 months ago

How many profiles is "lots" in this case?

webdock-io commented 5 months ago

Thanks for the quick reply.

I actually don't know as I can't list them - maybe I could get a count from the dump - but I'd say we are in the hundreds if not 1K+

tomponline commented 5 months ago

Try doing lxd sql global 'select * from profiles' or lxd sql global 'select count(*) from profiles'

webdock-io commented 5 months ago

Count gives me 616

just doing the select and dumping the table is pretty quick:

time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s
tomponline commented 5 months ago

Count gives me 616

just doing the select and dumping the table is pretty quick:

time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s

Cool thanks.

Suspect its doing a separate query to get each profile's config, rather than a single query with multiple profile IDs and then separating the results in LXD.

MggMuggins commented 5 months ago

@tomponline This was fixed with a fairly significant db refactor (see #10463 and #10183) that landed in LXD 5.5. This fun one-liner works great with LXD 5.21.1:

for p in p{1..1000}; do lxc profile create $p & done && lxc profile ls

It doesn't look to me like it's feasible to backport that set of fixes to 5.0.3; I'm guessing it won't be straightforward to come up with a separate patch for 5.0.3 either, although I haven't done much spelunking to confirm that. Let me know what you think the most reasonable course of action is here.

webdock-io commented 4 months ago

We upgraded our system to 5.21.1 and get this:

root@lxdremote:~# snap refresh lxd --channel=latest/stable
2024-06-04T07:46:48Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.21.1-2d13beb from Canonical✓ refreshed
root@lxdremote:~# nano /etc/hosts
root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded
root@lxdremote:~# lxc --version
5.21.1 LTS

Sooo... Not fixed @MggMuggins or am I missing something?

hamistao commented 2 days ago

@webdock-io Hi! Some news on this :D

I managed to reproduce the issue by simulating network latency between two local VMs, I suspect this is why @MggMuggins 's reproducer did not quite catch the problem.

tc qdisc replace dev enp5s0 root netem delay 100ms # simulate latency of 100ms on enp5s0 interface
for p in p{1..200}; do lxc profile create $p & done && lxc profile ls # this now results in a timeout

Mind that, in my reproduction, we only get a timeout when querying from a non-leader LXD cluster member. That makes sense, since all queries on the leader happen locally, so no latency. Could you confirm this also applies to your case?

I suspect this is happenning because we make a separate database query for each profile to populate the usedBy field. I am working on a fix for this now and, if my theory is correct, we should have it merged and working soon. The fix will then be backported to 5.21 in a few more days. Cheers :)

webdock-io commented 2 days ago

Thanks for your efforts. However, we've essentially switched all of our infrastructure almost 100% to Incus by now where this issue has been solved for ages (or, about a day after we reported it there)

This huge wait for bug fixes in LXD was a primary reason we switched, as it's untenable for production workloads like ours.

Anyway, I believe the issue did not stem from network latency as this was all happening on a single instance and not a cluster. I believe it was solved om Incus by simply refactoring database code to reduce lookups, doing some caching, things of that nature. But I really don't know the details, you'd have to check the Incus source for that :)

hamistao commented 2 days ago

Will do! In any case, thanks for your report and for your availability, we will proceed with the fix all the same.

hamistao commented 1 day ago

@tomponline This problem actually relates to a tomeout when listing profiles in a standalone envirionment.

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I plan on following up on the discussed fix to efficiently populate the usedby field, but keep in mind this is a separate problem that was uncovered while investigating this.

tomponline commented 1 day ago

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Suggest instead we first try importing these:

https://github.com/lxc/incus/pull/1140 https://github.com/lxc/incus/pull/1314

hamistao commented 1 day ago

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Yeah I agree

Suggest instead we first try importing these

Sure, I have seen those and they contain some caching logic that could be nice to have. But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

tomponline commented 1 day ago

But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

What is the issue then (I mean the one from the OP that is happening on a single node, not the one you described when accessing from a non-leader over a slow network)?