fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/
113 stars 62 forks source link

Too big /var/log/messages on vmhost-x86-copr01 #2799

Closed FrostyX closed 7 months ago

FrostyX commented 1 year ago
<zodbot>  PROBLEM -
    vmhost-x86-copr01.rdu-cc.fedoraproject.org/Disk_Space_/ is
    WARNING: DISK WARNING - free space: / 4326 MiB (14.28% inode=59%):
    (noc01)
<nirik>  so thats because /var/log/messages is really big... it seems
    like lots of copr sessions. I'll rotate the file, but FrostyX or
    other copr folks would need to look into the underlying
    problem. I'll compress the file for now.
FrostyX commented 1 year ago

At the moment the disk space is fine

Image

But I will try to setup some rotation for /var/log/messages that nirik had to do manually.

FrostyX commented 1 year ago

But I will try to setup some rotation for /var/log/messages that nirik had to do manually.

This might solve it https://pagure.io/fedora-infra/ansible/pull-request/1499

nirik commented 1 year ago

Note that it's "fine" now because I keep deleting logs so they stop alerting.

I'll just ack them so you can look more closely.

This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...

FrostyX commented 1 year ago

Note that it's "fine" now because I keep deleting logs so they stop alerting.

Thank you @nirik, I didn't know. I thought you only deleted it once.

This is not just that virthost, its all of them... it seems like the backend is connecting a LOT so the messages are mostly ssh connections...

It looks like resalloc went crazy. There are tens of

 INFO 2023-07-06 14:39:02,011 PID:3939926:140290543318720(Allocator-10992094) Allocating copr_hv_x86_64_01_prod_10992094_20230706_143901 finished => ENDED
 INFO 2023-07-06 14:39:02,015 PID:3939926:140289436006080(Allocator-10992093) Allocating copr_hv_x86_64_01_prod_10992093_20230706_143901 finished => ENDED
 INFO 2023-07-06 14:39:02,021 PID:3939926:140291608667840(Allocator-10992095) Allocating copr_hv_x86_64_01_prod_10992095_20230706_143901 finished => ENDED

rows in /var/log/resallocserver/main.log every second. When stopped resalloc for a few seconds, the SSH connections stopped appearing in the hypervisor log.

I am trying to investigate what is wrong with resalloc.

FrostyX commented 1 year ago

Ah, that's because all of those allocations fail

INFO:root:Hypervisor ID: 1
INFO:root:Using an IPv6 ending with ':c148'
DEBUG:root:Logging to qemu+ssh://copr@vmhost-x86-copr02.rdu-cc.fedoraproject.org/system
DEBUG:root:cmd: mkisofs -o /tmp/tmpdhja99z3/config.iso -V eimg_config -r -J --quiet /tmp/tmpdhja99z3/config
xorriso 1.5.4 : RockRidge filesystem manipulator, libburnia project.

DEBUG:root: -> exit_status=0, time=0.105s
INFO:root:using config image /tmp/tmpdhja99z3/config.iso
DEBUG:root:cmd: virsh -c qemu+ssh://copr@vmhost-x86-copr02.rdu-cc.fedoraproject.org/system vol-create-as images copr_hv_x86_64_02_prod_10992112_20230706_143902_config 1M
error: failed to connect to the hypervisor
error: End of file while reading data: Warning: Permanently added 'vmhost-x86-copr02.rdu-cc.fedoraproject.org' (ED25519) to the list of known hosts.^M
virt-ssh-helper: cannot connect to '/var/run/libvirt/libvirt-sock': Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused: Input/output error
DEBUG:root: -> exit_status=1, time=0.608s
DEBUG:root:Cleaning up ...
Traceback (most recent call last):
  File "/var/lib/resallocserver/provision/libvirt-new", line 461, in <module>
    _main()
  File "/var/lib/resallocserver/provision/libvirt-new", line 452, in _main
    spawner.spawn()
  File "/var/lib/resallocserver/provision/libvirt-new", line 342, in spawn
    self.create_volume_from_iso(config_vol_name, '1M', config_iso,
  File "/var/lib/resallocserver/provision/libvirt-new", line 226, in create_volume_from_iso
    self.alloc_disk(name, prealloc_size, pool)
  File "/var/lib/resallocserver/provision/libvirt-new", line 152, in alloc_disk
    raise Exception("can't create '{0}' disk".format(name))
Exception: can't create 'copr_hv_x86_64_02_prod_10992112_20230706_143902_config' disk
nirik commented 1 year ago

Strange. It looks like they are in a weird state. :(

If you like I can upgrade/reboot them and we can see if that clears things up? Or do you want to investigate more first?

FrostyX commented 1 year ago

The libvirtd service died 4 days ago, and therefore all the virsh commands failed. I don't know the reason why it died though

● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2023-07-02 06:36:27 GMT; 4 days ago
     Docs: man:libvirtd(8)
           https://libvirt.org
  Process: 3321797 ExecStart=/usr/sbin/libvirtd $LIBVIRTD_ARGS (code=exited, status=6)
 Main PID: 3321797 (code=exited, status=6)
    Tasks: 2 (limit: 32768)
   Memory: 64.3G
   CGroup: /system.slice/libvirtd.service
           ├─3506 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper
           └─3509 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/libexec/libvirt_leaseshelper

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I would say it was OOM kill but since it died on 3/4 hypervisors at the same time, IMHO something else happened. We discussed with @nirik that it would be a good idea to update and reboot those machines, but I would like to do that once @praiskup is back from vacation since the libvirt setup is a bit tricky for me.

Until then, I did systemctl start libvirtd.service and it seems to work fine. New builders are spawning, no nonsense in /var/log/messages.

praiskup commented 7 months ago

It seems we don't experience /var/log/messages message floods nowadays. Closing as WORKSFORME