Open webdock-io opened 5 months ago
This is probably due to inefficient queries. Lets look into making fewer queries that return the info needed.
How many profiles is "lots" in this case?
Thanks for the quick reply.
I actually don't know as I can't list them - maybe I could get a count from the dump - but I'd say we are in the hundreds if not 1K+
Try doing lxd sql global 'select * from profiles'
or lxd sql global 'select count(*) from profiles'
Count gives me 616
just doing the select and dumping the table is pretty quick:
time lxd sql global 'select * from profiles'
... stuff
real 0m0.166s
user 0m0.044s
sys 0m0.064s
Count gives me 616
just doing the select and dumping the table is pretty quick:
time lxd sql global 'select * from profiles' ... stuff real 0m0.166s user 0m0.044s sys 0m0.064s
Cool thanks.
Suspect its doing a separate query to get each profile's config, rather than a single query with multiple profile IDs and then separating the results in LXD.
@tomponline This was fixed with a fairly significant db refactor (see #10463 and #10183) that landed in LXD 5.5. This fun one-liner works great with LXD 5.21.1:
for p in p{1..1000}; do lxc profile create $p & done && lxc profile ls
It doesn't look to me like it's feasible to backport that set of fixes to 5.0.3; I'm guessing it won't be straightforward to come up with a separate patch for 5.0.3 either, although I haven't done much spelunking to confirm that. Let me know what you think the most reasonable course of action is here.
We upgraded our system to 5.21.1 and get this:
root@lxdremote:~# snap refresh lxd --channel=latest/stable
2024-06-04T07:46:48Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.21.1-2d13beb from Canonical✓ refreshed
root@lxdremote:~# nano /etc/hosts
root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded
root@lxdremote:~# lxc --version
5.21.1 LTS
Sooo... Not fixed @MggMuggins or am I missing something?
@webdock-io Hi! Some news on this :D
I managed to reproduce the issue by simulating network latency between two local VMs, I suspect this is why @MggMuggins 's reproducer did not quite catch the problem.
tc qdisc replace dev enp5s0 root netem delay 100ms # simulate latency of 100ms on enp5s0 interface
for p in p{1..200}; do lxc profile create $p & done && lxc profile ls # this now results in a timeout
Mind that, in my reproduction, we only get a timeout when querying from a non-leader LXD cluster member. That makes sense, since all queries on the leader happen locally, so no latency. Could you confirm this also applies to your case?
I suspect this is happenning because we make a separate database query for each profile to populate the usedBy
field. I am working on a fix for this now and, if my theory is correct, we should have it merged and working soon. The fix will then be backported to 5.21 in a few more days. Cheers :)
Thanks for your efforts. However, we've essentially switched all of our infrastructure almost 100% to Incus by now where this issue has been solved for ages (or, about a day after we reported it there)
This huge wait for bug fixes in LXD was a primary reason we switched, as it's untenable for production workloads like ours.
Anyway, I believe the issue did not stem from network latency as this was all happening on a single instance and not a cluster. I believe it was solved om Incus by simply refactoring database code to reduce lookups, doing some caching, things of that nature. But I really don't know the details, you'd have to check the Incus source for that :)
Will do! In any case, thanks for your report and for your availability, we will proceed with the fix all the same.
@tomponline This problem actually relates to a tomeout when listing profiles in a standalone envirionment.
To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.
I plan on following up on the discussed fix to efficiently populate the usedby
field, but keep in mind this is a separate problem that was uncovered while investigating this.
To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.
I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.
Suggest instead we first try importing these:
https://github.com/lxc/incus/pull/1140 https://github.com/lxc/incus/pull/1314
I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.
Yeah I agree
Suggest instead we first try importing these
Sure, I have seen those and they contain some caching logic that could be nice to have. But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.
But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.
What is the issue then (I mean the one from the OP that is happening on a single node, not the one you described when accessing from a non-leader over a slow network)?
Ubuntu Jammy LXD v5.0.3 and LXD 5.21.1
Running
lxc profile list
on a system with lots of profiles results in the following:Running
lxd sql global .dump
returns almost immediately and lists all data in the databaseWe have a real use case for supporting a lot of profiles in a remote (we allow our customers to build their own)
Adding and deleting individual profiles seems to work, although it's hard to confirm deletion when we can't list them with lxd.
Is there any way to increase the timeout in lxd to allow for listing of our (large, and will only grow larger) profiles list? We could start hacking away at sql queries, but I'd much rather be able to do an lxc profile list
(this use case came up as we actually wanted to make sure the list was cleaned up so any unused profiles were removed)