Closed FrostyX closed 7 months ago
At the moment the disk space is fine
But I will try to setup some rotation for /var/log/messages
that nirik had to do manually.
But I will try to setup some rotation for /var/log/messages that nirik had to do manually.
This might solve it https://pagure.io/fedora-infra/ansible/pull-request/1499
Note that it's "fine" now because I keep deleting logs so they stop alerting.
I'll just ack them so you can look more closely.
This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...
Note that it's "fine" now because I keep deleting logs so they stop alerting.
Thank you @nirik, I didn't know. I thought you only deleted it once.
This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...
It looks like resalloc went crazy. There are tens of
INFO 2023-07-06 14:39:02,011 PID:3939926:140290543318720(Allocator-10992094) Allocating copr_hv_x86_64_01_prod_10992094_20230706_143901 finished => ENDED
INFO 2023-07-06 14:39:02,015 PID:3939926:140289436006080(Allocator-10992093) Allocating copr_hv_x86_64_01_prod_10992093_20230706_143901 finished => ENDED
INFO 2023-07-06 14:39:02,021 PID:3939926:140291608667840(Allocator-10992095) Allocating copr_hv_x86_64_01_prod_10992095_20230706_143901 finished => ENDED
rows in /var/log/resallocserver/main.log
every second. When stopped resalloc for a few seconds, the SSH connections stopped appearing in the hypervisor log.
I am trying to investigate what is wrong with resalloc.
Ah, that's because all of those allocations fail
INFO:root:Hypervisor ID: 1
INFO:root:Using an IPv6 ending with ':c148'
DEBUG:root:Logging to qemu+ssh://copr@vmhost-x86-copr02.rdu-cc.fedoraproject.org/system
DEBUG:root:cmd: mkisofs -o /tmp/tmpdhja99z3/config.iso -V eimg_config -r -J --quiet /tmp/tmpdhja99z3/config
xorriso 1.5.4 : RockRidge filesystem manipulator, libburnia project.
DEBUG:root: -> exit_status=0, time=0.105s
INFO:root:using config image /tmp/tmpdhja99z3/config.iso
DEBUG:root:cmd: virsh -c qemu+ssh://copr@vmhost-x86-copr02.rdu-cc.fedoraproject.org/system vol-create-as images copr_hv_x86_64_02_prod_10992112_20230706_143902_config 1M
error: failed to connect to the hypervisor
error: End of file while reading data: Warning: Permanently added 'vmhost-x86-copr02.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.^M
virt-ssh-helper: cannot connect to '/var/run/libvirt/libvirt-sock': Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused: Input/output error
DEBUG:root: -> exit_status=1, time=0.608s
DEBUG:root:Cleaning up ...
Traceback (most recent call last):
File "/var/lib/resallocserver/provision/libvirt-new", line 461, in <module>
_main()
File "/var/lib/resallocserver/provision/libvirt-new", line 452, in _main
spawner.spawn()
File "/var/lib/resallocserver/provision/libvirt-new", line 342, in spawn
self.create_volume_from_iso(config_vol_name, '1M', config_iso,
File "/var/lib/resallocserver/provision/libvirt-new", line 226, in create_volume_from_iso
self.alloc_disk(name, prealloc_size, pool)
File "/var/lib/resallocserver/provision/libvirt-new", line 152, in alloc_disk
raise Exception("can't create '{0}' disk".format(name))
Exception: can't create 'copr_hv_x86_64_02_prod_10992112_20230706_143902_config' disk
Strange. It looks like they are in a weird state. :(
If you like I can upgrade/reboot them and we can see if that clears things up? Or do you want to investigate more first?
The libvirtd
service died 4 days ago, and therefore all the virsh
commands failed. I don't know the reason why it died though
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-07-02 06:36:27 GMT; 4 days ago
Docs: man:libvirtd(8)
https://libvirt.org
Process: 3321797 ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS (code=exited, status=6)
Main PID: 3321797 (code=exited, status=6)
Tasks: 2 (limit: 32768)
Memory: 64.3G
CGroup: /system.slice/libvirtd.service
├─3506 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
└─3509 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
I would say it was OOM kill but since it died on 3/4 hypervisors at the same time, IMHO something else happened. We discussed with @nirik that it would be a good idea to update and reboot those machines, but I would like to do that once @praiskup is back from vacation since the libvirt setup is a bit tricky for me.
Until then, I did systemctl start libvirtd.service
and it seems to work fine. New builders are spawning, no nonsense in /var/log/messages
.
It seems we don't experience /var/log/messages message floods nowadays. Closing as WORKSFORME