alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
58 stars 15 forks source link

Build analysis script failing on new VM image #1755

Closed craddm closed 6 months ago

craddm commented 7 months ago

:white_check_mark: Checklist

:computer: System information

:cactus: Powershell module versions

Not relevant

:no_entry_sign: Describe the problem

After building a new VM image, the analyse_build.py script is reporting two failed steps and that the build process is still running. However, the steps have not failed and the build process appears to have successfully completed.

:deciduous_tree: Log messages

Relevant log messages ```none srdbuildadmin@CandidateSecureResearchDesktop-Ubuntu2004-202403051504:~$ /opt/monitoring/analyse_build.py [2024-03-05 15:05:37: SUCCESS] Build started: [2024-03-05 15:06:36: SUCCESS] Running cloud-init modules: 59s [2024-03-05 15:36:16: SUCCESS] Preparing system for installation: 29m39s [2024-03-05 15:38:23: SUCCESS] Installing additional deb/source packages: 2m7s [2024-03-05 15:41:21: SUCCESS] Updating ClamAV database: 2m58s [2024-03-05 15:54:25: SUCCESS] Installing Spark/Hadoop: 13m4s [2024-03-05 16:07:02: SUCCESS] Installing Julia and packages: 12m37s [2024-03-05 16:22:31: SUCCESS] Installing Python (3.9.18) and packages: 15m29s [2024-03-05 16:35:24: SUCCESS] Installing Python (3.10.13) and packages: 12m53s [2024-03-05 16:47:17: SUCCESS] Installing Python (3.11.8) and packages: 11m53s [2024-03-05 16:47:29: SUCCESS] Configuring pyenv: 12s [2024-03-05 17:11:57: SUCCESS] Installing R packages: 24m28s [2024-03-05 17:12:09: SUCCESS] Cleaning up the build environment: 12s [2024-03-05 17:12:15: SUCCESS] Checking environment configuration: 6s [2024-03-05 17:12:17: FAIL ] Removing Linux omsagent: 2s [2024-03-05 17:12:18: FAIL ] Finished build: 2h6m41s [2024-03-06 09:54:02: RUNNING] Installing/updating Ubuntu packages: 16h41m43s [2024-03-05 17:12:17: INFO ] Memory available: 17 GB [2024-03-05 17:12:17: INFO ] ..... mean usage: 11.37% => 1.9 GB [2024-03-05 17:12:17: INFO ] ...... min usage: 5.34% => 0.9 GB [2024-03-05 17:12:17: INFO ] ...... max usage: 33.13% => 5.6 GB [2024-03-05 17:12:17: INFO ] CPU available: 8 cores [2024-03-05 17:12:17: INFO ] ..... mean usage: 36.19% => 2.9 cores [2024-03-05 17:12:17: INFO ] ...... min usage: 0.10% => 0.0 cores [2024-03-05 17:12:17: INFO ] ...... max usage: 100.00% => 8.0 cores ```

:recycle: To reproduce

jemrobinson commented 7 months ago

This is only one failed step I think: the line [2024-03-05 17:12:18: FAIL ] Finished build: 2h6m41s is a failure because of one or more failures during the build.

craddm commented 7 months ago

Fair enough - but the failed step hasn't failed as far as I can tell, and the step that is "running" isn't running.

jemrobinson commented 7 months ago

Can you post the contents of /var/log/cloud-init-output.log here?

Presumably it's the line sh onboard_agent.sh --purge || die "Could not remove Linux omsagent!"; that has caused this failure. Are we sure that:

  1. this job succeeded
  2. this script returns 0 on success
craddm commented 7 months ago

Can you post the contents of /var/log/cloud-init-output.log here?

Presumably it's the line sh onboard_agent.sh --purge || die "Could not remove Linux omsagent!"; that has caused this failure. Are we sure that:

  1. this job succeeded
  2. this script returns 0 on success

I've added the cloud-init log to the issue.

OMSagent is not installed, which I'm taking to mean the script ran successfully. The alternative would be that it was not installed, but that would mean something changed in the last few weeks. Correction: apparently it has failed, as now it seems the OMS agent extension is installed. (weirdly, this is my 2nd attempt at building a VM image where I've hit this issue, and the first time it wasn't installed).

I wasn't getting this problem when I built a VM image a few weeks ago after implementing this fix, FWIW.

jemrobinson commented 7 months ago

Can we run onboard_agent.sh in some kind of debug/verbose mode?

craddm commented 7 months ago

Ah, of course the cloud-init log is too long to post. Basically there is nothing in the log to indicate a failure, but also nothing to indicate a success.

I just ran onboard_agent.sh manually and it does seem to be verbose and return a 0 on success.

None of that shows up in the cloud-init log, however - it's like the process stops at a certain point.

Here's the end of cloud-init-output.log:

178300K .......... .......... .......... .......... .......... 74%  187M 0s
178350K .......... .......... .......... .......... .......... 74%  279M 0s
178400K .......... .......... .......... .......... .......... 74%  305M 0s
178450K .......... .......... .......... ........Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'init-local' at Tue, 05 Mar 2024 23:04:58 +0000. Up 7.24 seconds.
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'init' at Tue, 05 Mar 2024 23:05:02 +0000. Up 10.60 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |           Address            |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: |  eth0  | True |          10.48.0.10          | 255.255.255.0 | global | 60:45:bd:d0:87:7e |
ci-info: |  eth0  | True | fe80::6245:bdff:fed0:877e/64 |       .       |  link  | 60:45:bd:d0:87:7e |
ci-info: |   lo   | True |          127.0.0.1           |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |           ::1/128            |       .       |  host  |         .         |
ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: | Route |   Destination   |  Gateway  |     Genmask     | Interface | Flags |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: |   0   |     0.0.0.0     | 10.48.0.1 |     0.0.0.0     |    eth0   |   UG  |
ci-info: |   1   |    10.48.0.0    |  0.0.0.0  |  255.255.255.0  |    eth0   |   U   |
ci-info: |   2   |  168.63.129.16  | 10.48.0.1 | 255.255.255.255 |    eth0   |  UGH  |
ci-info: |   3   | 169.254.169.254 | 10.48.0.1 | 255.255.255.255 |    eth0   |  UGH  |
ci-info: +-------+-----------------+-----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
ci-info: |   3   |    local    |    ::   |    eth0   |   U   |
ci-info: |   4   |  multicast  |    ::   |    eth0   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:config' at Tue, 05 Mar 2024 23:05:08 +0000. Up 17.25 seconds.
Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:final' at Tue, 05 Mar 2024 23:05:09 +0000. Up 18.06 seconds.
System setup through cloud-init is finished. Configuration took 19.03 seconds

You can see that it seems to cut off in the middle of a download. So perhaps for some reason the VM is shutting down before completing this step.

craddm commented 7 months ago

Digging through the cloud-init-output.log finds a couple of other oddities:

Cloud-init v. 23.4.4-0ubuntu0~20.04.1 running 'modules:final' at Tue, 05 Mar 2024 15:06:36 +0000. Up 121.87 seconds.
2024-03-05 15:35:43,830 - util.py[WARNING]: Failed to install packages: ['adcli', 'apt-transport-https', 'aspell', 'auditd', 'automake', 'autotools-dev', 'bison', 'blobfuse', 'build-essential', 'cargo', 'clamav', 'clamav-base', 'clamav-daemon', 'clamav-freshclam', 'clamav-unofficial-sigs', 'cmake', 'code', 'cuda', 'curl', 'dbeaver-ce', 'default-jdk', 'emacs', 'firefox', 'flex', 'g++', 'gcc', 'gdebi-core', 'gfortran', 'git', 'git-core', 'grass', 'imagemagick', 'jags', 'jupyter', 'jupyter-nbconvert', 'jupyter-notebook', 'krb5-user', 'ldap-utils', 'libboost-all-dev', 'libbz2-dev', 'libcudnn8', 'libcudnn8-dev', 'libcurl4-openssl-dev', 'libeccodes-dev', 'libffi-dev', 'libfftw3-dev', 'libfribidi-dev', 'libgdal-dev', 'libglpk-dev', 'libgmp-dev', 'libgsl-dev', 'libgtk2.0-dev', 'libharfbuzz-dev', 'liblapack-dev', 'liblzma-dev', 'libmagick++-dev', 'libmpfr-dev', 'libncurses5-dev', 'libncursesw5-dev', 'libopenblas-dev', 'libpoppler-cpp-dev', 'libpq-dev', 'libreadline-dev', 'libreoffice', 'libsasl2-dev', 'libsodium-dev', 'libsqlite3-dev', 'libssl-dev', 'libtool', 'libudunits2-dev', 'libv8-dev', 'libxml2-dev', 'libzmq3-dev', 'llvm', 'lyx', 'm4', 'nfs-common', 'octave', 'onboard', 'p7zip', 'parallel', 'postgresql-client', 'python3', 'python3-openssl', 'python3-pip', 'qgis', 'r-base', 'r-base-core', 'r-base-dev', 'realmd', 'ristretto', 'rustc', 'scala', 'sssd', 'sssd-tools', 'sysstat', 'texlive-full', 'texstudio', 'tk-dev', 'tmpreaper', 'unixodbc-dev', 'unzip', 'weka', 'weka-doc', 'wget', 'xfce4', 'xfce4-terminal', 'xrdp', 'xz-utils', 'zlib1g-dev', 'zsh']
2024-03-05 15:35:43,948 - cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
2024-03-05 15:35:43,950 - util.py[WARNING]: Running module package_update_upgrade_install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py'>) failed

Note that many of the packages listed do in fact get successfully installed at some point, as they're there at the command line.

ERROR: shui 0.7.0 has requirement packaging<22.0,>=21.0, but you'll have packaging 23.2 which is incompatible.
Installing collected packages: packaging, ujson, psutil, glances, semantic-version, charset-normalizer, requests, python-gnupg, termcolor, fire, attrs, rpds-py, referencing, zipp, importlib-resources, jsonschema-specifications, pkgutil-resolve-name, jsonschema, wget, requests-futures, jill, numpy, contourpy, fonttools, matplotlib, PyYAML, tqdm, pastel, pylev, clikit, cleo, lib-detect-testenv, cli-exit-tools, pathlib3x, url-normalize, shui
  Attempting uninstall: psutil
    Found existing installation: psutil 5.5.1
    Not uninstalling psutil at /usr/lib/python3/dist-packages, outside environment /usr
    Can't uninstall 'psutil'. No files were found to uninstall.
<continues with various other packages>

The shui error doesn't cause a problem, as Spark/Hadoop gets successfully installed later on.

jemrobinson commented 7 months ago

shui is here. I can release an updated version if that helps. Which python version are you running with?

craddm commented 7 months ago

System Python is 3.8.10

jemrobinson commented 7 months ago

It's annoying that failure in the packages step doesn't cause the job to fail. Our workaround is to check for as many executables that should have been installed during this step as possible. If any of these are not found then we mark the build as failed.

craddm commented 7 months ago

Also just ran the check_installation.py script manually, and that also finds two errors:

srdbuildadmin@CandidateSecureResearchDesktop-Ubuntu2004-202403051504:~$ sudo python3 /opt/build/check_installation.py
Programming languages:
... cmake [/usr/bin/cmake] 3.16.3
... g++ [/usr/bin/g++] 9.4.0
... gcc [/usr/bin/gcc] 9.4.0
... gfortran [/usr/bin/gfortran] 9.4.0
... java [/usr/bin/java] 11.0.22
... julia [/usr/local/bin/julia] 1.10.2
... Python 3.10 [/opt/pyenv/versions/3.10.13/bin/python] 3.10.13
... Python 3.9 [/opt/pyenv/versions/3.9.18/bin/python] 3.9.18
... Python 3.11 [/opt/pyenv/versions/3.11.8/bin/python] 3.11.8
... R [/usr/bin/R] 4.3.3
... rustc [/usr/bin/rustc] 1.75.0
... scala [/usr/bin/scala] 2.11.12
... ERROR spark-shell not found!
Package managers:
... cargo [/usr/bin/cargo] cargo 1.75.0
... pip (Python 3.10) [/opt/pyenv/versions/3.10.13/bin/pip] 24.0
... pip (Python 3.9) [/opt/pyenv/versions/3.9.18/bin/pip] 24.0
... pip (Python 3.11) [/opt/pyenv/versions/3.11.8/bin/pip] 24.0
Editors/IDEs:
... code [/usr/bin/code] 1.87.0
... emacs [/usr/bin/emacs] 26.3
... nano [/usr/bin/nano] 4.8
... pycharm-community [/snap/bin/pycharm-community] 2023.3.4
... rstudio [/usr/bin/rstudio] 2023.12.1+402
... vim [/usr/bin/vim] 8.1.2269-1ubuntu5.21
Presentation tools:
... latex [/usr/bin/latex] 3.14159265-2.6-1.40.20
... libreoffice [/usr/bin/libreoffice] 6.4.7.2
... xelatex [/usr/bin/xelatex] 3.14159265-2.6-0.999991
Development tools:
... azuredatastudio [/usr/bin/azuredatastudio] 1.48.0-1708981902
... bash [/usr/bin/bash] 5.0.17(1)-release
... dbeaver-ce [/usr/bin/dbeaver-ce] 24.0.0
... firefox [/usr/bin/firefox] 123.0.1
... git [/usr/bin/git] 2.25.1
... htop [/usr/bin/htop] 2.2.0
... nvidia-smi [/usr/bin/nvidia-smi] 550.54.14
... psql [/usr/bin/psql] 16.2-1.pgdg20.04+1
... ERROR sqlcmd not found!
... weka [/usr/bin/weka] 3.6.14
36 test(s), 2 failure(s)

These seem to be false alarms, as both the erroring commands work from the cli.

EDIT: Never mind, that's because I'm running it with sudo and they're not on the path there.

jemrobinson commented 7 months ago

Weird though, as we run cloud-init as root and copy the PATH to /etc/skel so they ought to be locatable in both cases.

craddm commented 7 months ago

I'm suspecting that the fail is actually the failure in the packages step, and the logs are parsing incorrectly and misattributing the failure to omsagent removal step.

From part of the cloud-init dump below it seems clear that this step should be a fail, but it's marked as still running by analyse_build.py.

 {
  "description": "running config-package_update_upgrade_install with frequency once-per-instance",
  "event_type": "finish",
  "name": "modules-final/config-package_update_upgrade_install",
  "origin": "cloudinit",
  "result": "FAIL",
  "timestamp": 1709731334.0
 }
jemrobinson commented 7 months ago

It's totally possible that the omsagent step actually is failing if it tries to e.g. use an executable that isn't there or similar. I don't think we'll solve the issue of cloud-init failures for 4.2.0 but it's something to think about for the Pulumi release (e.g. does Bureau have a more robust system for checking that packages are installed @JimMadge ?)

craddm commented 7 months ago

ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.

jemrobinson commented 7 months ago

Could we make removing omsagent part of the deploy time cloud-init instead?

craddm commented 7 months ago

Hmm, instructions here for unregistering a workspace might do the job without having to actually remove it: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/agent-manage?tabs=PowerShellLinux#enable-auto-update-for-the-linux-agent

Will try that first as part of the image building process, but should also work as part of the deploy time cloud-init.

JimMadge commented 7 months ago

(e.g. does Bureau have a more robust system for checking that packages are installed @JimMadge ?)

This, and a lot of other things, will be easier in Ansible compared to bash/cloud-init. On this specifically, you could make assertions or (hopefully) use roles to make sure any dependencies are installed with idempotent tasks.

JimMadge commented 7 months ago

ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.

This makes me wonder whether this is an organisational level Azure policy. Like the NetworkWatcher RGs that keep appearing.

craddm commented 7 months ago

ok, something very very odd is going on. yesterday evening (~5pm), the omsagent was definitely not there on the latest build candidate. The folders were not there on the machine, and omsagent was not showing up on the portal as an extension/application for the machine. This morning it is installed again.

This makes me wonder whether this is an organisational level Azure policy. Like the NetworkWatcher RGs that keep appearing.

Exactly what I was thinking. Had a quick a look at policies on the VM images sub and haven't spotted anything obvious. Found it. There is indeed a policy to check if the machines have monitoring agent installed, and it installs it if not. Was applied this morning at ~4am.

Thus I suspect the omsagent removal step has been working, as, as I say, immediately after creating the VM, omsagent is absent. It just got reinstalled later. Normally we'd have moved on to generalising the VM image before that happens

JimMadge commented 7 months ago

In that case, does it mean everything should work so long as you complete the image build quickly and avoiding 4am?

craddm commented 7 months ago

Possibly, but it's still not totally clear to me what is and isn't failing. something is failing, even if it isn't the omsagent step. I'm leaning towards that being incorrect parsing of the logs. It appears to be something to do with some apt packages failing to install, which can be seen in the cloud init log excerpts above. But as far as I can tell those packages are actually installed, so I'm not sure why they come up as failed.

craddm commented 7 months ago

Right, ok, scratch that. Omsagent was not present on the newest VM image I created, even though, this time, the removal step definitely failed: I instead used the command to disconnect an existing Omsagent from the default workspace, and that failed because the omsagent was not present to disconnect.

What I'm thinking now: 1) For some SRD candidate images, the image was left for long enough after the initial build that the omsagent was installed automatically as per policy. 2) SRDs created from that image would have omsagent registered with the default workspace, which would then create issues when trying to create new SRDs within an SRE 3) SRDs that go from the build candidate step to the gallery step before the omsagent is installed automatically would never hit this problem.

Thus, the multihoming problem might appear with some images but not others.

This also means that removing anything to do with omsagent as part of the cloud-init for building the SRD candidate image won't necessarily fix the problem, as it may be automatically re-installed anyway as per policy.

Adding to the deploy-time cloud-init might be more robust. We should check if omsagent is present, and if so, remove its association with the default workspace. That way it should work fine.

I would note that I am sure that the omsagent removal step did not produce a fail message when I was testing it at the time, so I'm still quite confused about that.

jemrobinson commented 7 months ago

How about removing omsagent during image creation then?

craddm commented 7 months ago

Something I've just tried and that seems to work: I modified the deprovision_vm.sh script. It now checks if omsagent is present, and, if it is, disconnects it from any connected workspaces before deprovisioning the VM. The resulting built VM is not linked to any workspaces, even if omsagent is still installed, and can then be hooked up to the correct workspace using Setup_SRE_Monitoring.ps1 as normal.

craddm commented 7 months ago

Note that some packages in the Ubuntu repository had changed name, and this was causing an additional failure of the updating/installing Ubuntu packages step. It was not easy to figure out which packages were failing, as lots of packages purportedly failed to install but were subsequently found to have installed correctly.

To debug it, we uploaded the packages-apt.list and passed it to apt-get install using xargs, which highlighted that two packages were not found/renamed.

JimMadge commented 7 months ago

Reopening until this is merged/confirmed.

@craddm Did you look at what those packages were? Removing them might be other things? We should probably have different packages files for each ubuntu release as the package names and availability will change between releases.

craddm commented 7 months ago

I removed git-core, which was just a dummy package that installs the git package. And I renamed libv8-dev to libnode-dev. These were the two packages that were reported as problematic when trying to install from package-apt.list and after removing/renaming them, the new SRD image built without errors.

JimMadge commented 7 months ago

git-core may not be a problem. If the libnode package has changed name, this will break builds for older (presumably focal) releases where the old name is correct.

craddm commented 7 months ago

It was on focal that we were having the problem. We haven't tried building with jammy. We weren't using jammy as it wasn't supported by the Log Analytics Agent, although it actually should be now.

JimMadge commented 7 months ago

Oh I see, surprised we didn't notice before. What about bionic? Perhaps the sensible thing is to only support focal this release?

jemrobinson commented 7 months ago

We can only use 20.04 at the moment as this is the latest version supported by Azure Update Management.

JimMadge commented 6 months ago

For the next release, we should look into this as 24.04 is coming soon. Being two LTSs behind doesn't feel great.

JimMadge commented 6 months ago

Note that some packages in the Ubuntu repository had changed name, and this was causing an additional failure of the updating/installing Ubuntu packages step.

To be clear on this part. I would be quite surprised if the names of packages in the repos changed part way through a release. Between releases package names do change and packages get added or removed (and sometimes it takes a while for third-part packages to get added e.g. Docker). I think the usual way to handle this is to have a way to handle the differences between releases, like here.

If we have to make a choice for this release, I think I would lean towards supporting one release only.