att-comdev / ucp-integration

This project has moved to OpenStack.
https://www.airshipit.org/
Apache License 2.0
4 stars 4 forks source link

Bubbling up relevant error information during site deploy #17

Open darrendejaeger opened 6 years ago

darrendejaeger commented 6 years ago

I've encountered a couple scenarios where the site deploy toolset at my disposal has not been very "clear" when it comes to understanding an issue that has occurred during the deployment process. An example of this: I specified a particular package to be installed onto the hosts during the deployment process via the site manifests. There wasn't any issue at all on the Genesis host. However, when I got to the actual site deploy, I ran into some trouble that was difficult to track down. The MaaS GUI was showing nodes "deployed", but they weren't joining to my k8s cluster. Digging deeper showed the following:

Queried node's BMC - Power state queried: onFri, 19 Jan. 2018 17:13:37
Node post-installation failure - 'cloudinit' running modules for configFri, 19 Jan. 2018 17:13:32
Node post-installation failure - 'cloudinit' running config-apt-configure with frequency once-per-instanceFri, 19 Jan. 2018 17:13:24
Node changed status - From 'Deploying' to 'Deployed'Fri, 19 Jan. 2018 17:13:17

Digging deeper, I searched the clout-init logs on the particular host (/var/log/cloud-init-output.log, /var/log/cloud-init.log), but came up empty-handed. It wasn't until I examined /var/log/syslog that I found my problem:

17:23:59 promjoin.sh[2230]: + apt-get install -y --no-install-recommends ceph-common=10.2.7-0ubuntu0.16.04.1 curl jq docker-engine=1.13.1-0~ubuntu-xenial socat=1.7.3.1-1
17:23:59 promjoin.sh[2230]: Reading package lists...
17:23:59 promjoin.sh[2230]: Building dependency tree...
17:23:59 promjoin.sh[2230]: Reading state information...
17:23:59 promjoin.sh[2230]: E: Version '10.2.7-0ubuntu0.16.04.1' for 'ceph-common' was not found
17:23:59 promjoin.sh[2230]: ++ date +%s
17:23:59 promjoin.sh[2230]: + now=1516382639
17:23:59 promjoin.sh[2230]: + [[ 1516382639 -gt 1516382635 ]]
17:23:59 promjoin.sh[2230]: + log Failed to install apt packages.
17:23:59 promjoin.sh[2230]: ++ date
17:23:59 promjoin.sh[2230]: + echo Fri Jan 19 17:23:59 UTC 2018 Failed to install apt packages.
17:23:59 promjoin.sh[2230]: Fri Jan 19 17:23:59 UTC 2018 Failed to install apt packages.
17:23:59 promjoin.sh[2230]: + exit 1
17:23:59 systemd[1]: promjoin.service: Main process exited, code=exited, status=1/FAILURE
17:23:59 systemd[1]: promjoin.service: Unit entered failed state.
17:23:59 systemd[1]: promjoin.service: Failed with result 'exit-code'.

Would it be possible, in some fashion, to make it easier to determine root cause for these types of troubles?