fedora-sysv / initscripts

📜 Scripts to bring up network interfaces and legacy utilities in Fedora.
GNU General Public License v2.0
46 stars 51 forks source link

remove rename_device_lock when process does not exist #361

Closed wangxp006 closed 3 years ago

wangxp006 commented 3 years ago

if rename_device is killed during fopen period in take_lock ,LOCKFILE will not be remove in sighandler. Excute rename_device again, LOCKFILE is exist ,but the process with pid in LOCKFILE is not running, take_lock will never return. Resolves: #1916545

jamacku commented 3 years ago

Related to #360

wangxp006 commented 3 years ago

HI @jamacku Is this PR OK ? If there is any problem ,please let me know。 Thank you.

jamacku commented 3 years ago

Hi @wangxp006 , may I ask you I how did you discover this issue? Thank you.

wangxp006 commented 3 years ago

Hi @wangxp006 , may I ask you I how did you discover this issue? Thank you. Hi @jamacku

At the beginning we found err log in systemd-udevd Like systemd-udevd[1259]: seq 2780 'devices/pci0000:00/0000:00:02.5/0000:08:00,0/virtio8/net/eth1' is taking a long time systemd-udevd[1259]: seq 2780 'devices/pci0000:00/0000:00:02.5/0000:08:00,0/virtio8/net/eth1' killed systemd-udevd[1259]: worker [20644] terminated by signal 9 (kill) systemd-udevd[1259]: worker [20644] failed while hadling 'devices/pci0000:00/0000:00:02.5/0000:08:00,0/virtio8/net/eth1'

We find that systemd-udevd should have renamed eth1 by rename_device , but failed。

We exculte /lib/udev/rename_device eth1 manually, rename_device will nerver exit。 We find rename_device infinite loop in take_lock function. And the /dev/.rename_device.lock is exist, but pid in /dev/.rename_device.lock is not exist in running processes.

wangxp006 commented 3 years ago

We find than , if the pid in /dev/.rename_device.lock is not effective running process, /lib/udev/rename_device will never successful excuted

lnykryn commented 3 years ago

I must say I don't like the approach here. It is only a workaround for a different issue.

Based on the logs I can't tell what happened. I see two scenarios. Either this is the first run of rename devices and it got killed by udev because something went wrong during the parsing of files and wee need to find the real culprit there. Or this is the second run of it after something went south with the first run and then it might mean that our timeout inside the rename_devices is bigger than the udev timeout itself and we should fix that. Could you post the full log from the machine where you see this issue?

But we should also consider if we need the lock there at all. In past, we tried to do some magic around swapping names, that required synchronous approach, but we gave upon such effort.

wangxp006 commented 3 years ago

Since you do not like this PR, I'll just close it