SUSE-Enceladus / azure-li-services

Azure Large Instance Services
GNU General Public License v3.0
7 stars 0 forks source link

Issue with Production Build SLES12-SP4-SAP-Azure-VLI-BYOS.x86_64-0.0.24-Production-Build1.4 #241

Closed jaiawasthi closed 4 years ago

jaiawasthi commented 4 years ago

Hi Marcus,

After booting an image, we install Software Foundation an HPE recommended software on the host [We always do that, it's not specific to this image] During testing after installing SFS 2.0 version and rebooting the system, we find that the system enters emergency mode.


[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
[  OK  ] Stopped Entropy Daemon based on the HAVEGE algorithm.
[  OK  ] Started Entropy Daemon based on the HAVEGE algorithm.
         Starting Apply Kernel Variables...
         Starting udev Coldplug all Devices...
         Starting Create Static Device Nodes in /dev...
         Starting Load/Save Random Seed...
         Starting Flush Journal to Persistent Storage...
[  OK  ] Started Create Static Device Nodes in /dev.
[  OK  ] Started Load/Save Random Seed.
[FAILED] Failed to start Apply Kernel Variables.
See 'systemctl status systemd-sysctl.service' for details.
         Starting udev Kernel Device Manager...
[  OK  ] Started udev Kernel Device Manager.
[  OK  ] Started Flush Journal to Persistent Storage.
[  OK  ] Started udev Coldplug all Devices.
         Starting udev Wait for Complete Device Initialization...
[  OK  ] Found device /dev/ttyS0.
[  OK  ] Created slice system-cryptctl\x2dauto\x2dunlock.slice.
         Starting Load Kernel Modules...
         Starting Show Plymouth Boot Screen...
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
         Starting Apply Kernel Variables...
[FAILED] Failed to start Apply Kernel Variables.
See 'systemctl status systemd-sysctl.service' for details.
[  OK  ] Started Show Plymouth Boot Screen.
[  OK  ] Started Forward Password Requests to Plymouth Directory Watch.
         Starting Load Kernel Modules...
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
         Starting Load Kernel Modules...
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
         Starting Apply Kernel Variables...
[  OK  ] Found device /dev/disk/by-uuid/2760-CA11.
         Starting Load Kernel Modules...
[FAILED] Failed to start Apply Kernel Variables.
See 'systemctl status systemd-sysctl.service' for details.
[FAILED] Failed to start Load Kernel Modules.
 See 'systemctl status systemd-modules-load.service' for details.
[  OK  ] Found device /dev/disk/by-label/SWAP.
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
         Starting Apply Kernel Variables...
         Activating swap /dev/disk/by-label/SWAP...
[FAILED] Failed to start Apply Kernel Variables.
See 'systemctl status systemd-sysctl.service' for details.
[  OK  ] Activated swap /dev/disk/by-label/SWAP.
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
         Starting Apply Kernel Variables...
[  OK  ] Reached target Swap.
[FAILED] Failed to start Apply Kernel Variables.
See 'systemctl status systemd-sysctl.service' for details.
[  OK  ] Started udev Wait for Complete Device Initialization.
         Starting Device-Mapper Multipath Device Controller...
[  OK  ] Started Device-Mapper Multipath Device Controller.
[  OK  ] Reached target Local File Systems (Pre).
         Mounting /boot/efi...
[FAILED] Failed to mount /boot/efi.
See 'systemctl status boot-efi.mount' for details.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Corrected mac... check interrupt manager daemon.
         Starting Restore /run/initramfs on shutdown...
[  OK  ] Stopped Detect if the system suffers from bsc#1089761.
[  OK  ] Stopped Disk encryption utility (cr...7d and keep the server informed.
[  OK  ] Stopped Serial Getty on ttyS0.
[  OK  ] Closed Open-iSCSI iscsiuio Socket.
[  OK  ] Stopped Daily rotation of log files.
[  OK  ] Stopped Disk encryption utility (cr...d0 and keep the server informed.
[  OK  ] Stopped Disk encryption utility (cr...d2 and keep the server informed.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Closed UUID daemon activation socket.
[  OK  ] Stopped Disk encryption utility (cr...r0 and keep the server informed.
[  OK  ] Stopped Login and scanning of iSCSI devices.
[  OK  ] Closed Open-iSCSI iscsid Socket.
[  OK  ] Stopped Disk encryption utility (cr...95 and keep the server informed.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped Load kdump kernel and initrd.
[  OK  ] Stopped wicked managed network interfaces.
[  OK  ] Stopped wicked network nanny service.
[  OK  ] Stopped Command Scheduler.
[  OK  ] Stopped Purge old kernels.
[  OK  ] Stopped OpenSSH Daemon.
[  OK  ] Stopped Dynamic System Tuning Daemon.
[  OK  ] Stopped gr_systat.service.
[  OK  ] Stopped /etc/init.d/after.local Compatibility.
[  OK  ] Stopped System Logging Service.
[  OK  ] Closed Syslog Socket.
[  OK  ] Stopped Disk encryption utility (cr...11 and keep the server informed.
[  OK  ] Stopped wicked network management service daemon.
[  OK  ] Stopped wicked AutoIPv4 supplicant service.
[  OK  ] Stopped wicked DHCPv4 supplicant service.
[  OK  ] Stopped Getty on tty1.
[  OK  ] Reached target Login Prompts.
[  OK  ] Stopped Terminate Plymouth Boot Screen.
[  OK  ] Stopped Discard unused blocks once a week.
[  OK  ] Stopped Hold until boot process finishes up.
[  OK  ] Stopped /etc/init.d/boot.local Compatibility.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped Load kdump kernel early on startup.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
[  OK  ] Reached target Timers.
[  OK  ] Stopped Login Service.
[  OK  ] Stopped Disk encryption utility (cr...37 and keep the server informed.
[  OK  ] Stopped YaST2 Firstboot.
[  OK  ] Stopped YaST2 Second Stage.
[  OK  ] Stopped wicked DHCPv6 supplicant service.
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target Basic System.
[  OK  ] Reached target Sockets.
[  OK  ] Stopped target System Initialization.
[  OK  ] Started Emergency Shell.
[  OK  ] Reached target Emergency Mode.
         Starting Create Volatile Files and Directories...
         Starting Tell Plymouth To Write Out Runtime Data...
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
         Mounting /hana/log/H35/mnt00001...
         Mounting /usr/sap/H35...
         Mounting /hana/logbackups/H35...
         Mounting /hana/shared/H35...
         Mounting /hana/data/H35/mnt00001...
[  OK  ] Started Restore /run/initramfs on shutdown.
[  OK  ] Started Create Volatile Files and Directories.
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.
[FAILED] Failed to mount /hana/shared/H35.
See 'systemctl status hana-shared-H35.mount' for details.
[DEPEND] Dependency failed for Remote File Systems.
[FAILED] Failed to mount /hana/log/H35/mnt00001.
See 'systemctl status hana-log-H35-mnt00001.mount' for details.
[FAILED] Failed to mount /usr/sap/H35.
See 'systemctl status usr-sap-H35.mount' for details.
[FAILED] Failed to mount /hana/logbackups/H35.
See 'systemctl status hana-logbackups-H35.mount' for details.
[FAILED] Failed to mount /hana/data/H35/mnt00001.
See 'systemctl status hana-data-H35-mnt00001.mount' for details.
[  OK  ] Started Tell Plymouth To You are in emergGive root password for maintenance
(or press Control-D to continue):

azsollabdsm35:~ # systemctl status boot-efi.mount -l
● boot-efi.mount - /boot/efi
   Loaded: loaded (/etc/fstab; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2020-09-11 11:07:40 UTC; 30min ago
    Where: /boot/efi
     What: /dev/disk/by-uuid/2760-CA11
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 7083 ExecMount=/usr/bin/mount /dev/disk/by-uuid/2760-CA11 /boot/efi -t vfat (code=exited, status=32)

Sep 11 11:07:40 azsollabdsm35 systemd[1]: Mounting /boot/efi...
Sep 11 11:07:40 azsollabdsm35 mount[7083]: mount: unknown filesystem type 'vfat'
Sep 11 11:07:40 azsollabdsm35 systemd[1]: boot-efi.mount: Mount process exited, code=exited status=32
Sep 11 11:07:40 azsollabdsm35 systemd[1]: Failed to mount /boot/efi.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: boot-efi.mount: Unit entered failed state.

azsollabdsm35:~ # systemctl status systemd-modules-load.service -l
● systemd-modules-load.service - Load Kernel Modules
   Loaded: loaded (/usr/lib/systemd/system/systemd-modules-load.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2020-09-11 11:07:40 UTC; 43min ago
     Docs: man:systemd-modules-load.service(8)
           man:modules-load.d(5)
  Process: 7030 ExecStart=/usr/lib/systemd/systemd-modules-load (code=exited, status=1/FAILURE)
 Main PID: 7030 (code=exited, status=1/FAILURE)

Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Main process exited, code=exited, status=1/FAILURE
Sep 11 11:07:40 azsollabdsm35 systemd[1]: Failed to start Load Kernel Modules.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Unit entered failed state.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Failed with result 'exit-code'.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Start request repeated too quickly.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: Failed to start Load Kernel Modules.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Failed with result 'start-limit'.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Start request repeated too quickly.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: Failed to start Load Kernel Modules.
Sep 11 11:07:40 azsollabdsm35 systemd[1]: systemd-modules-load.service: Failed with result 'start-limit'.
schaefi commented 4 years ago

Hmm, no mount process ever worked after the changes you did to the system. Messages like

mount: unknown filesystem type 'vfat'

scares me. I'm sorry but I have no insight what has been done by your installation process but it looks like it did something with the kernel and maybe the dracut initrd was not rebuild with the new modules such that there is now a mismatch. Just a wild guess but would it help if you would have called dracut prior reboot ?

After all at the moment I don't see how we can help here ?

Thoughts ?

schaefi commented 4 years ago

I'll give the image a smoke test without modifications just to make sure the version we delivered works

jaiawasthi commented 4 years ago

@schaefi we have installed a HPE recommended software , system-foundation-software. If you want i can provide that to you and maybe you can work with that ?

schaefi commented 4 years ago

ok the image without modifications just works and has e.g efi mounted as expected:

... on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

I guess you have no opportunity to access your system after reboot right ? So no debugging possible. If so I suggest you deploy again and after the install of your stuff don't reboot and let us first have a look at the situation

schaefi commented 4 years ago

@schaefi we have installed a HPE recommended software , system-foundation-software. If you want i can provide that to you and maybe you can work with that ?

I rather think HPE software should be looked at by the engineers who wrote that. This would lead to a much faster solution

schaefi commented 4 years ago

Stupid question btw. Is that additional software stack in some way certified/tested for SLES ?

jaiawasthi commented 4 years ago

ok the image without modifications just works and has e.g efi mounted as expected: ... on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859->1,shortname=mixed,errors=remount-ro)

Yes after installing hpe software, it goes a bit haywire.

I guess you have no opportunity to access your system after reboot right ? So no debugging possible.

it enters into emergency mode, so i do have some access

I rather think HPE software should be looked at by the engineers who wrote that.

sure, but since its something on os also, so wanted to get some insight from you.

Stupid question btw. Is that additional software stack in some way certified/tested for SLES ?

yes, its certified for sles12sp4, do you want to test it out ?

jaiawasthi commented 4 years ago

@schaefi , can you please help us here

schaefi commented 4 years ago

can you please help us here

Can you bring up a machine with the HPE stuff installed but not yet rebooted. If I can ssh to a machine in that state I can look if I find something. You can also run the supportconfig tool and grab the log from what it says. Together with all that info a bug should be opened by you including our findings and assigned to the people who maintains this.

Would that help ?

jaiawasthi commented 4 years ago

@schaefi sure, thanks.let me see if the machines are available, will try to get you supportconfig as well as access to the blades.

rjschwei commented 4 years ago

Just to re-iterate, with a very high likely hood this is something HPE will have to fix. So the longer we keep HPE out of the loop the longer it will take to find a fix.

jaiawasthi commented 4 years ago

@rjschwei , I have a raised a parallel request to HPE as well, I'm waiting on their analysis now. But since we are having a common interaction, hence raised an issue here as well, if we could together find some root cause. Also, i did do the complete thing again and the issue was not reproducible, so will need to check again.

rjschwei commented 4 years ago

@jaiawasthi thanks for the effort. Maybe it was a fluke, keep us posted.

jaawasth commented 4 years ago

We gave a couple of retries, but weren't able to reproduce the issue.

schaefi commented 4 years ago

ok, so let's close this and hope it does not come back.