Open KLIM8D opened 12 months ago
This comes up from time to time and is usually a process on your system periodically cleaning the tmp directory.
@tomponline Thanks! That seems about right, could also be the reason why I experience this now when switching to the snap lxd version.
But I'm a bit confused, what should the contents of /usr/lib/tmpfiles.d/snapd.conf
be, atm it contains:
D! /tmp/snap-private-tmp 0700 root root -
Should I change the D! to an x or append the line x /tmp/snap-private-tmp/snap.lxd
above? What should be restarted afterwards?
I've added it above and tested with SYSTEMD_LOG_LEVEL=debug /usr/bin/systemd-tmpfiles --clean
Ignoring entry D! "/tmp/snap-private-tmp" because --boot is not specified.
Running clean action for entry x /tmp/snap-private-tmp/snap.lxd
Running clean action for entry D /tmp
systemd-tmpfiles
pages states
If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any time, e.g. on package upgrades. systemd-tmpfiles(8) will take lines with an exclamation mark only into consideration, if the --boot option is given.
But does the third line mean that the snap.lxd directory were still cleaned, even though D! was specified for its parent dir? Just trying to understand and make sure it wont happen again.
Looking at man tmpfiles.d
it says:
D /directory/to/create-and-remove mode user group cleanup-age -
x /path-or-glob/to/ignore/recursively - - - - -
...
If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any
time, e.g. on package upgrades. systemd-tmpfiles will take lines with an exclamation mark only into consideration, if the --boot option is given.
So you would want to be adding a new line to /usr/lib/tmpfiles.d/snapd.conf
or adding a new file to /usr/lib/tmpfiles.d/
.
Whether to use D!
or x
, I'm not sure, but looking now, LXD creates /tmp/snap-private-tmp/snap.lxd/tmp/
so its possible that using D! /tmp/snap-private-tmp/snap.lxd/
will still cause /tmp/snap-private-tmp/snap.lxd/tmp
to be cleaned normally, so using x
is probably the safer choice.
Chatting with the snapd team theres been a suggestion that perhaps LXD should use the XDG_RUNTIME_DIR path for its runtime state files (i.e. /run/user/0/snap.
Unfortunately it didn't solve the error first described, after a few days the "Error: Failed to retrieve PID..." is back.
Contents of /usr/lib/tmpfiles.d/snapd.conf
x /tmp/snap-private-tmp/snap.lxd
D! /tmp/snap-private-tmp 0700 root root -
Trying console instead
$ lxc console mongo-2
To detach from the console, press: <ctrl>+a q
Error: Error opening config file: "loading config file for the container failed"
Error: write /dev/pts/ptmx: file already closed
You reloaded the systemd-tmpfiles service after making the change?
Yes, prettry sure I did. Reloaded systemd with systemctl daemon-reload
, just to be sure and systemctl restart systemd-tmpfiles-clean.timer
Just did it again on all LXD hosts. I'll restart all the faulty containers and see if it happens again within a day or two.
You could try stopping and disabling the service for a while too and see if that helps.
Instances restarted two days ago, are now giving this error again. It's not all the containers that were restarted, that gives this error again, for some which were restarted, I'm still able to use "lxc exec". So a bit inconsistent.
I'm beginning to have my doubts it has something to do with the systemd-tmpfiles-clean job. Are there happening anything special with or "inside" the snap package? Older version though, but never had the issue with non-snap LXD.
What's strange is that I have one LXD cluster host, where not a single instance has this issue and containers have been running for many days. Haven't spotted the difference between that cluster host and the others yet, other than is has "database standby" role.
Found this thread (https://discuss.linuxcontainers.org/t/forkexec-failed-to-load-config-file/16220/9) and looked at my hosts to see if this also was the case, that this lxc.conf
file was missing. What was interesting is that on all of the 3 hosts, where containers fail, this config file is missing in the /var/snap/lxd/common/lxd/logs/CONTAINER
directory. And on the host where I don't see this problem, the lxc.conf
file is there for all the containers.
On one of the failing LXD hosts, this is the content of forkexec.log
in the container directory
Failed to load config file /var/snap/lxd/common/lxd/logs/db-lb-3/lxc.conf for /var/snap/lxd/common/lxd/containers/db-lb-3
There are only forkexec.log
and lxc.log
any of the container directories within .../lxd/logs
# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
/var/snap/lxd/common/lxd/logs/db-lb-3:
forkexec.log lxc.log
Could this be the reason I'm getting this error?
Restarting the container, all these file are present again
# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
total 100
-rw------- 1 root root 6483 Aug 14 10:36 console.log
-rw-r--r-- 1 root root 122 Aug 14 10:10 forkexec.log
-rw-r--r-- 1 root root 0 Aug 14 10:36 forkstart.log
-rw-r----- 1 root root 2508 Aug 14 10:36 lxc.conf
-rw-r----- 1 root root 80587 Aug 14 10:36 lxc.log
-rw-r----- 1 root root 2939 Aug 14 10:36 lxc.log.old
We've also been experiencing the same problem in our environments, basically, we created clusters, and even out of the box if we leave the cluster for a while we are seeing the same error while trying to exec into the containers, we can reboot the containers to bash into them again but that is not convenient in a production environment for obvious reasons. It would be great if we can be provided some instructions on how to go about troubleshooting this.
Manually restoring lxc.conf
in /var/snap/lxd/common/lxd/logs/<CONTAINER>
makes a faulty container, that gives this error, exec commands again. Just took NIC-values from lxc config show
and grapped the PID for the lxc.hook.pre-start
line.
So the questions is, why does this file suddenly disappear...?
Thank you @KLIM8D
We have the same issue and think of a workaround for now, to keep a copy of lxc.conf when a container starts, and have a way of alerting when lxc.conf disappears, and of sending it back.
To have it in a logs directory is certainly odd, a /run subdir would be more appropriate I would say.
On top of that we seem to have at times issues with some logrotation in this directory. We had lxc.log files recreated with 0 bytes while the lxc monitor process keeps the old file open, filing up /var, and only a restart of the container releases the space of the already deleted file.
I am not sure whether this misbehaviour of logrotation may also delete the lxc.conf in the same directory. I have not figured out where the logrotation configuration for the log files here is kept.
Apparently it is the entire content, or at least all the containers directories in /var/snap/lxd/common/lxd/logs/
that is cleaned up by something. Seems OK if it wasn't the fact that the lxc.conf
is needed..
Tried setting a rule in auditd, to see what deleted the lxc.conf
file and to my surprise, no containers directory was in there. Running exec for either of the containers creates a directory for the container in /var/snap/lxd/common/lxd/logs/
but it only contains forkexec.log
and lxc.log
.
@petrosssit So maybe you should check in your workaround script, if the directory is there or not.
# ls /var/snap/lxd/common/lxd/logs/
dnsmasq.lxdbr0.log lxd.log
After exec for test-redis
container
# ls /var/snap/lxd/common/lxd/logs/*
/var/snap/lxd/common/lxd/logs/dnsmasq.lxdbr0.log /var/snap/lxd/common/lxd/logs/lxd.log
/var/snap/lxd/common/lxd/logs/test-redis:
forkexec.log lxc.log
Same issue with LXC/D 5.0.2 from Debian Bookworm ... lxc.conf (writen in /var/log/lxd/[container]/lxc.conf disapears.
More answer on this ?
To be more precise on my setup I work with CEPH as storage backend.
To go further, I've added a script who periodically check lxc.conf (1 test / 10s) presence and keep all API logs : the file disapears when the cleaning of expired backup task started.
I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0
Hope we could find more on this bug...
Nice workaround to regenerate lxc.conf : lxc config show [container] > /var/log/lxd/[container]/lxc.conf
@nfournil A simple empty lxc.conf file works too.
touch /var/snap/lxd/common/lxd/logs/[container]/lxc.conf
¯_(ツ)_/¯
Edit: "lxc exec" with an empty lxc.conf works, but later data is added to that lxc.conf file (right after I did an lxc exec). So it's not really empty all the while.
I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0
Hi, please could you clarify what you mean here?
Il you go to this file (who seems to be the 5.x place for this function) : you find the protection to lxc.log file, but not to lxc.conf anymore ...
Le mar. 31 oct. 2023 à 17:06, Tom Parrott @.***> a écrit :
I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0
Hi, please could you clarify what you mean here?
— Reply to this email directly, view it on GitHub https://github.com/canonical/lxd/issues/12084#issuecomment-1787526846, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYB3TEUCASQ2QPN6H3ZM3DYCEOXPAVCNFSM6AAAAAA22KN24WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGUZDMOBUGY . You are receiving this because you were mentioned.Message ID: @.***>
Had added a more agressive (1 check / s if lxc.conf file exists) logging and I confirm that file disapears when the API calls :
location: srv05-r2b-fl1 metadata: context: class: task description: Expiring log files operation: 42d4066c-3b40-4649-867b-637556984e7a project: "" level: debug message: Started operation timestamp: "2023-11-08T11:00:47.092443683+01:00" type: logging
less than 1/2s next, file doesn't exists anymore...
@mihalicyn I think if we cherry-pick this https://github.com/lxc/incus/pull/361 it should help.
Looks likeinstanceLogDelete
is preventing lxc.cong from being deleted
One solution to this might be to stop using tmp and log dirs for .conf files, similar to https://github.com/lxc/incus/pull/426
5.15.0-76-generic #83~20.04.1-Ubuntu
I have a cluster with 4 nodes. There are 44 containers spread across the nodes and atm I'm getting this error message trying to run
lxc exec
on 19 of the containers. Besides 3 containers, the other 41 are using ceph rbd for storage.I can start the container and exec works for a while and then all of the sudden I'm getting the error message. Restarting the container works and I can get a shell or whatever again, but then after X amount of hours, it happens again.
lxc info
lxc config show amqp-1 --expanded
lxc monitor