Closed craddm closed 6 months ago
This is only one failed step I think: the line [2024-03-05 17:12:18: FAIL ] Finished build: 2h6m41s
is a failure because of one or more failures during the build.
Fair enough - but the failed step hasn't failed as far as I can tell, and the step that is "running" isn't running.
Can you post the contents of /var/log/cloud-init-output.log
here?
Presumably it's the line sh onboard_agent.sh --purge || die "Could not remove Linux omsagent!";
that has caused this failure. Are we sure that:
Can you post the contents of
/var/log/cloud-init-output.log
here?Presumably it's the line
sh onboard_agent.sh --purge || die "Could not remove Linux omsagent!";
that has caused this failure. Are we sure that:
- this job succeeded
- this script returns 0 on success
I've added the cloud-init log to the issue.
OMSagent is not installed, which I'm taking to mean the script ran successfully. The alternative would be that it was not installed, but that would mean something changed in the last few weeks. Correction: apparently it has failed, as now it seems the OMS agent extension is installed. (weirdly, this is my 2nd attempt at building a VM image where I've hit this issue, and the first time it wasn't installed).
I wasn't getting this problem when I built a VM image a few weeks ago after implementing this fix, FWIW.
Can we run onboard_agent.sh
in some kind of debug/verbose mode?
Ah, of course the cloud-init log is too long to post. Basically there is nothing in the log to indicate a failure, but also nothing to indicate a success.
I just ran onboard_agent.sh
manually and it does seem to be verbose and return a 0 on success.
None of that shows up in the cloud-init log, however - it's like the process stops at a certain point.
Here's the end of cloud-init-output.log
:
178300K .......... .......... .......... .......... .......... 74% 187M 0s
178350K .......... .......... .......... .......... .......... 74% 279M 0s
178400K .......... .......... .......... .......... .......... 74% 305M 0s
178450K .......... .......... .......... ........Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'init-local' at Tue, 05 Mar 2024 23:04:58 +0000. Up 7.24 seconds.
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'init' at Tue, 05 Mar 2024 23:05:02 +0000. Up 10.60 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: | eth0 | True | 10.48.0.10 | 255.255.255.0 | global | 60:45:bd:d0:87:7e |
ci-info: | eth0 | True | fe80::6245:bdff:fed0:877e/64 | . | link | 60:45:bd:d0:87:7e |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.48.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.48.0.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U |
ci-info: | 2 | 168.63.129.16 | 10.48.0.1 | 255.255.255.255 | eth0 | UGH |
ci-info: | 3 | 169.254.169.254 | 10.48.0.1 | 255.255.255.255 | eth0 | UGH |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | 1 | fe80::/64 | :: | eth0 | U |
ci-info: | 3 | local | :: | eth0 | U |
ci-info: | 4 | multicast | :: | eth0 | U |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:config' at Tue, 05 Mar 2024 23:05:08 +0000. Up 17.25 seconds.
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:final' at Tue, 05 Mar 2024 23:05:09 +0000. Up 18.06 seconds.
System setup through cloud-init is finished. Configuration took 19.03 seconds
You can see that it seems to cut off in the middle of a download. So perhaps for some reason the VM is shutting down before completing this step.
Digging through the cloud-init-output.log
finds a couple of other oddities:
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:final' at Tue, 05 Mar 2024 15:06:36 +0000. Up 121.87 seconds.
2024-03-05 15:35:43,830 - util.py[WARNING]: Failed to install packages: ['adcli', 'apt-transport-https', 'aspell', 'auditd', 'automake', 'autotools-dev', 'bison', 'blobfuse', 'build-essential', 'cargo', 'clamav', 'clamav-base', 'clamav-daemon', 'clamav-freshclam', 'clamav-unofficial-sigs', 'cmake', 'code', 'cuda', 'curl', 'dbeaver-ce', 'default-jdk', 'emacs', 'firefox', 'flex', 'g++', 'gcc', 'gdebi-core', 'gfortran', 'git', 'git-core', 'grass', 'imagemagick', 'jags', 'jupyter', 'jupyter-nbconvert', 'jupyter-notebook', 'krb5-user', 'ldap-utils', 'libboost-all-dev', 'libbz2-dev', 'libcudnn8', 'libcudnn8-dev', 'libcurl4-openssl-dev', 'libeccodes-dev', 'libffi-dev', 'libfftw3-dev', 'libfribidi-dev', 'libgdal-dev', 'libglpk-dev', 'libgmp-dev', 'libgsl-dev', 'libgtk2.0-dev', 'libharfbuzz-dev', 'liblapack-dev', 'liblzma-dev', 'libmagick++-dev', 'libmpfr-dev', 'libncurses5-dev', 'libncursesw5-dev', 'libopenblas-dev', 'libpoppler-cpp-dev', 'libpq-dev', 'libreadline-dev', 'libreoffice', 'libsasl2-dev', 'libsodium-dev', 'libsqlite3-dev', 'libssl-dev', 'libtool', 'libudunits2-dev', 'libv8-dev', 'libxml2-dev', 'libzmq3-dev', 'llvm', 'lyx', 'm4', 'nfs-common', 'octave', 'onboard', 'p7zip', 'parallel', 'postgresql-client', 'python3', 'python3-openssl', 'python3-pip', 'qgis', 'r-base', 'r-base-core', 'r-base-dev', 'realmd', 'ristretto', 'rustc', 'scala', 'sssd', 'sssd-tools', 'sysstat', 'texlive-full', 'texstudio', 'tk-dev', 'tmpreaper', 'unixodbc-dev', 'unzip', 'weka', 'weka-doc', 'wget', 'xfce4', 'xfce4-terminal', 'xrdp', 'xz-utils', 'zlib1g-dev', 'zsh']
2024-03-05 15:35:43,948 - cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
2024-03-05 15:35:43,950 - util.py[WARNING]: Running module package_update_upgrade_install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py'>) failed
Note that many of the packages listed do in fact get successfully installed at some point, as they're there at the command line.
ERROR: shui 0.7.0 has requirement packaging<22.0,>=21.0, but you'll have packaging 23.2 which is incompatible.
Installing collected packages: packaging, ujson, psutil, glances, semantic-version, charset-normalizer, requests, python-gnupg, termcolor, fire, attrs, rpds-py, referencing, zipp, importlib-resources, jsonschema-specifications, pkgutil-resolve-name, jsonschema, wget, requests-futures, jill, numpy, contourpy, fonttools, matplotlib, PyYAML, tqdm, pastel, pylev, clikit, cleo, lib-detect-testenv, cli-exit-tools, pathlib3x, url-normalize, shui
Attempting uninstall: psutil
Found existing installation: psutil 5.5.1
Not uninstalling psutil at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'psutil'. No files were found to uninstall.
<continues with various other packages>
The shui error doesn't cause a problem, as Spark/Hadoop gets successfully installed later on.
shui
is here. I can release an updated version if that helps. Which python version are you running with?
System Python is 3.8.10
It's annoying that failure in the packages
step doesn't cause the job to fail. Our workaround is to check for as many executables that should have been installed during this step as possible. If any of these are not found then we mark the build as failed.
Also just ran the check_installation.py
script manually, and that also finds two errors:
srdbuildadmin@CandidateSecureResearchDesktop-Ubuntu2004-202403051504:~$ sudo python3 /opt/build/check_installation.py
Programming languages:
... cmake [/usr/bin/cmake] 3.16.3
... g++ [/usr/bin/g++] 9.4.0
... gcc [/usr/bin/gcc] 9.4.0
... gfortran [/usr/bin/gfortran] 9.4.0
... java [/usr/bin/java] 11.0.22
... julia [/usr/local/bin/julia] 1.10.2
... Python 3.10 [/opt/pyenv/versions/3.10.13/bin/python] 3.10.13
... Python 3.9 [/opt/pyenv/versions/3.9.18/bin/python] 3.9.18
... Python 3.11 [/opt/pyenv/versions/3.11.8/bin/python] 3.11.8
... R [/usr/bin/R] 4.3.3
... rustc [/usr/bin/rustc] 1.75.0
... scala [/usr/bin/scala] 2.11.12
... ERROR spark-shell not found!
Package managers:
... cargo [/usr/bin/cargo] cargo 1.75.0
... pip (Python 3.10) [/opt/pyenv/versions/3.10.13/bin/pip] 24.0
... pip (Python 3.9) [/opt/pyenv/versions/3.9.18/bin/pip] 24.0
... pip (Python 3.11) [/opt/pyenv/versions/3.11.8/bin/pip] 24.0
Editors/IDEs:
... code [/usr/bin/code] 1.87.0
... emacs [/usr/bin/emacs] 26.3
... nano [/usr/bin/nano] 4.8
... pycharm-community [/snap/bin/pycharm-community] 2023.3.4
... rstudio [/usr/bin/rstudio] 2023.12.1+402
... vim [/usr/bin/vim] 8.1.2269-1ubuntu5.21
Presentation tools:
... latex [/usr/bin/latex] 3.14159265-2.6-1.40.20
... libreoffice [/usr/bin/libreoffice] 6.4.7.2
... xelatex [/usr/bin/xelatex] 3.14159265-2.6-0.999991
Development tools:
... azuredatastudio [/usr/bin/azuredatastudio] 1.48.0-1708981902
... bash [/usr/bin/bash] 5.0.17(1)-release
... dbeaver-ce [/usr/bin/dbeaver-ce] 24.0.0
... firefox [/usr/bin/firefox] 123.0.1
... git [/usr/bin/git] 2.25.1
... htop [/usr/bin/htop] 2.2.0
... nvidia-smi [/usr/bin/nvidia-smi] 550.54.14
... psql [/usr/bin/psql] 16.2-1.pgdg20.04+1
... ERROR sqlcmd not found!
... weka [/usr/bin/weka] 3.6.14
36 test(s), 2 failure(s)
These seem to be false alarms, as both the erroring commands work from the cli.
EDIT: Never mind, that's because I'm running it with sudo
and they're not on the path there.
Weird though, as we run cloud-init
as root and copy the PATH to /etc/skel so they ought to be locatable in both cases.
I'm suspecting that the fail is actually the failure in the packages step, and the logs are parsing incorrectly and misattributing the failure to omsagent removal step.
From part of the cloud-init dump below it seems clear that this step should be a fail, but it's marked as still running by analyse_build.py
.
{
"description": "running config-package_update_upgrade_install with frequency once-per-instance",
"event_type": "finish",
"name": "modules-final/config-package_update_upgrade_install",
"origin": "cloudinit",
"result": "FAIL",
"timestamp": 1709731334.0
}
It's totally possible that the omsagent
step actually is failing if it tries to e.g. use an executable that isn't there or similar. I don't think we'll solve the issue of cloud-init failures for 4.2.0 but it's something to think about for the Pulumi release (e.g. does Bureau have a more robust system for checking that packages are installed @JimMadge ?)
ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.
Could we make removing omsagent
part of the deploy time cloud-init instead?
Hmm, instructions here for unregistering a workspace might do the job without having to actually remove it: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/agent-manage?tabs=PowerShellLinux#enable-auto-update-for-the-linux-agent
Will try that first as part of the image building process, but should also work as part of the deploy time cloud-init.
(e.g. does Bureau have a more robust system for checking that packages are installed @JimMadge ?)
This, and a lot of other things, will be easier in Ansible compared to bash/cloud-init. On this specifically, you could make assertions or (hopefully) use roles to make sure any dependencies are installed with idempotent tasks.
ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.
This makes me wonder whether this is an organisational level Azure policy. Like the NetworkWatcher RGs that keep appearing.
ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.
This makes me wonder whether this is an organisational level Azure policy. Like the NetworkWatcher RGs that keep appearing.
Exactly what I was thinking. Had a quick a look at policies on the VM images sub and haven't spotted anything obvious. Found it. There is indeed a policy to check if the machines have monitoring agent installed, and it installs it if not. Was applied this morning at ~4am.
Thus I suspect the omsagent removal step has been working, as, as I say, immediately after creating the VM, omsagent is absent. It just got reinstalled later. Normally we'd have moved on to generalising the VM image before that happens
In that case, does it mean everything should work so long as you complete the image build quickly and avoiding 4am?
Possibly, but it's still not totally clear to me what is and isn't failing. something is failing, even if it isn't the omsagent
step. I'm leaning towards that being incorrect parsing of the logs. It appears to be something to do with some apt packages failing to install, which can be seen in the cloud init log excerpts above. But as far as I can tell those packages are actually installed, so I'm not sure why they come up as failed.
Right, ok, scratch that. Omsagent was not present on the newest VM image I created, even though, this time, the removal step definitely failed: I instead used the command to disconnect an existing Omsagent from the default workspace, and that failed because the omsagent was not present to disconnect.
What I'm thinking now: 1) For some SRD candidate images, the image was left for long enough after the initial build that the omsagent was installed automatically as per policy. 2) SRDs created from that image would have omsagent registered with the default workspace, which would then create issues when trying to create new SRDs within an SRE 3) SRDs that go from the build candidate step to the gallery step before the omsagent is installed automatically would never hit this problem.
Thus, the multihoming problem might appear with some images but not others.
This also means that removing anything to do with omsagent as part of the cloud-init for building the SRD candidate image won't necessarily fix the problem, as it may be automatically re-installed anyway as per policy.
Adding to the deploy-time cloud-init might be more robust. We should check if omsagent is present, and if so, remove its association with the default workspace. That way it should work fine.
I would note that I am sure that the omsagent removal step did not produce a fail message when I was testing it at the time, so I'm still quite confused about that.
How about removing omsagent during image creation then?
Something I've just tried and that seems to work: I modified the deprovision_vm.sh
script. It now checks if omsagent
is present, and, if it is, disconnects it from any connected workspaces before deprovisioning the VM. The resulting built VM is not linked to any workspaces, even if omsagent is still installed, and can then be hooked up to the correct workspace using Setup_SRE_Monitoring.ps1
as normal.
Note that some packages in the Ubuntu repository had changed name, and this was causing an additional failure of the updating/installing Ubuntu packages step. It was not easy to figure out which packages were failing, as lots of packages purportedly failed to install but were subsequently found to have installed correctly.
To debug it, we uploaded the packages-apt.list
and passed it to apt-get install
using xargs
, which highlighted that two packages were not found/renamed.
Reopening until this is merged/confirmed.
@craddm Did you look at what those packages were? Removing them might be other things? We should probably have different packages files for each ubuntu release as the package names and availability will change between releases.
I removed git-core
, which was just a dummy package that installs the git
package. And I renamed libv8-dev
to libnode-dev
. These were the two packages that were reported as problematic when trying to install from package-apt.list
and after removing/renaming them, the new SRD image built without errors.
git-core
may not be a problem. If the libnode package has changed name, this will break builds for older (presumably focal) releases where the old name is correct.
It was on focal that we were having the problem. We haven't tried building with jammy. We weren't using jammy as it wasn't supported by the Log Analytics Agent, although it actually should be now.
Oh I see, surprised we didn't notice before. What about bionic? Perhaps the sensible thing is to only support focal this release?
We can only use 20.04 at the moment as this is the latest version supported by Azure Update Management.
For the next release, we should look into this as 24.04 is coming soon. Being two LTSs behind doesn't feel great.
Note that some packages in the Ubuntu repository had changed name, and this was causing an additional failure of the updating/installing Ubuntu packages step.
To be clear on this part. I would be quite surprised if the names of packages in the repos changed part way through a release. Between releases package names do change and packages get added or removed (and sometimes it takes a while for third-part packages to get added e.g. Docker). I think the usual way to handle this is to have a way to handle the differences between releases, like here.
If we have to make a choice for this release, I think I would lean towards supporting one release only.
:white_check_mark: Checklist
:computer: System information
:cactus: Powershell module versions
Not relevant
:no_entry_sign: Describe the problem
After building a new VM image, the
analyse_build.py
script is reporting two failed steps and that the build process is still running. However, the steps have not failed and the build process appears to have successfully completed.:deciduous_tree: Log messages
Relevant log messages
```none srdbuildadmin@CandidateSecureResearchDesktop-Ubuntu2004-202403051504:~$ /opt/monitoring/analyse_build.py [2024-03-05 15:05:37: SUCCESS] Build started: [2024-03-05 15:06:36: SUCCESS] Running cloud-init modules: 59s [2024-03-05 15:36:16: SUCCESS] Preparing system for installation: 29m39s [2024-03-05 15:38:23: SUCCESS] Installing additional deb/source packages: 2m7s [2024-03-05 15:41:21: SUCCESS] Updating ClamAV database: 2m58s [2024-03-05 15:54:25: SUCCESS] Installing Spark/Hadoop: 13m4s [2024-03-05 16:07:02: SUCCESS] Installing Julia and packages: 12m37s [2024-03-05 16:22:31: SUCCESS] Installing Python (3.9.18) and packages: 15m29s [2024-03-05 16:35:24: SUCCESS] Installing Python (3.10.13) and packages: 12m53s [2024-03-05 16:47:17: SUCCESS] Installing Python (3.11.8) and packages: 11m53s [2024-03-05 16:47:29: SUCCESS] Configuring pyenv: 12s [2024-03-05 17:11:57: SUCCESS] Installing R packages: 24m28s [2024-03-05 17:12:09: SUCCESS] Cleaning up the build environment: 12s [2024-03-05 17:12:15: SUCCESS] Checking environment configuration: 6s [2024-03-05 17:12:17: FAIL ] Removing Linux omsagent: 2s [2024-03-05 17:12:18: FAIL ] Finished build: 2h6m41s [2024-03-06 09:54:02: RUNNING] Installing/updating Ubuntu packages: 16h41m43s [2024-03-05 17:12:17: INFO ] Memory available: 17 GB [2024-03-05 17:12:17: INFO ] ..... mean usage: 11.37% => 1.9 GB [2024-03-05 17:12:17: INFO ] ...... min usage: 5.34% => 0.9 GB [2024-03-05 17:12:17: INFO ] ...... max usage: 33.13% => 5.6 GB [2024-03-05 17:12:17: INFO ] CPU available: 8 cores [2024-03-05 17:12:17: INFO ] ..... mean usage: 36.19% => 2.9 cores [2024-03-05 17:12:17: INFO ] ...... min usage: 0.10% => 0.0 cores [2024-03-05 17:12:17: INFO ] ...... max usage: 100.00% => 8.0 cores ```:recycle: To reproduce