cablelabs / snaps-boot

Linux install and network setup for SNAPS
Apache License 2.0
14 stars 11 forks source link

Ensure still works properly with 16.04.5 #148

Closed spisarski closed 6 years ago

spisarski commented 6 years ago

As the 16.04.4 is no longer available from the Ubuntu download site, we need to be sure that 16.04.5 still operates as designed. I have observed issues when using .5 after iaas_launch.py -s on 3 of 4 servers; however, I did not have time to lockdown the issue at hand so this task should help expose any potential issues

bo-quan commented 6 years ago

Confirmed that with 16.04.5, "-s" option does not restart nodes automatically in lab2 (HP DL G9), and I had to manually reboot the servers via iLO.

spisarski commented 6 years ago

16.04.5 ISO sometimes reboots properly and/or takes an inordinately long period of time to restart.

Reverting back to .4 appears to work; however, this bug affects any approach to automation as the .4 image is no longer available from the Ubuntu download site.

bo-quan commented 6 years ago

Aricent could not reproduce the issue for 16.04.5 (on Dell servers), CableLabs has seen this issue on HP servers. Aricent will try to use CableLabs lab1 compute nodes to reproduce this issue.

bo-quan commented 6 years ago

CableLabs did not have this reboot issue with 16.04.5 in lab3 (Dell PowerEdge servers) either, so this appears to be isolated to HP servers.

bo-quan commented 6 years ago

Aricent was able to access lab1 and reproduce it. Suspect either one power supply is disconnected or due to using legacy bios mode. Bo will try to collect ilo logs by manually rebooting the server.

bo-quan commented 6 years ago

Additional experiment showed that the ubuntu start took a long time after post (see attached screenshot.) img_20181009_122641

bo-quan commented 6 years ago

No obvious errors from post or log files. Need to dig more into this.

raman-mann commented 6 years ago

From the logs it seems that the boot process stuck for while trying to bring up eno1 interface, here it waits for more then 5 minutes. problem seems to happen because of the cloud-init. eno1 is not sending DHCP request to the server, though cloud init is configured to do so. Issue #155 should fix this.
We will check and confirm the same

bo-quan commented 6 years ago

I confirmed that applying work-around for Issue #155 (i.e., disabling cloud-init) did the trick by reducing the boot-up time for lab1-compute1; however, any explanation why we did not see the issue with 16.04.4?

bo-quan commented 6 years ago

During today's sync-up meeting, disabling cloud-init would stop ssh key from being properly set-up for both vm and bare metal. We need to reconsider the approach, therefore the previously closed #155 is put back under review.

mansi-jain2 commented 6 years ago

We are able to setup ssh keys even after removing 50-cloud-init.cfg responsible for network configuration. 50-cloud-init.cfg file is generated by cloud-init by default for network configuration. To disable cloud-init's network configuration capabilities, we need to write a file /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg : network: {config: disabled}

bo-quan commented 6 years ago

Bo will validate this in lab1.

bo-quan commented 6 years ago

Will fix in #155 in place, the server reboot is now normal for 16.04.5.