OpenFabrics / fsdp_docs

Other
2 stars 3 forks source link

rdma_setup.sh failing and causing kernel panic #103

Closed JSpewock closed 2 years ago

JSpewock commented 2 years ago

In recent CKI testing it has been found that on some occasions the rdma_setup.sh script fails producing many errors that resemble /etc/sysconfig/network-scripts/*: No such file or directory. After this setup script fails, the system restarts and ends in a kernel panic. An example of a job where this setup script fails can be found here and the console logs containing the errors can be found here.

dledford commented 2 years ago

There are two separate issues here. I'm going to open a new issue for one and we can handle the other in this thread. The first issue is that the rdma_setup.sh scripts rely on the system using the old SysV init network files and Fedora rawhide (and maybe Fedora 36 too) have switched to NetworkManager connections without the old SysV init network files (also known as rhconfig network files). That is now being handled in issue #104 .

The second problem, and it's not related to the first but instead appears to be specific to the debug options used in Rawhide kernels, is the kernel panic. To fix that, I think we need to find out what exactly is causing the kernel panic and then use a command line option to disable it. It appears to be coming from the audit subsystem, which would mean SELinux. However, it could also be any number of other things especially since there appears to be a number of debugging options turned on in the Rawhide kernel and those can sometimes cause more problems than they solve. But, without getting a system up and running to see what the kernel's debug options are, it can be a guessing game to try and figure out which options to disable (and whether they even can be disabled from the command line or if they require a different kernel to be built).

JSpewock commented 2 years ago

I just updated the Rawhide image we were using and provisioned a system with it which seems to run fine. I'm unsure if this would fix the kernel panic issues but it might be worth a try if it was a common problem people were facing

JSpewock commented 2 years ago

Adding selinux=0 to the kernel options for hosts fixed the kernel panic that was occurring on hosts