alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
58 stars 15 forks source link

Path clash with Azure local storage #2221

Open craddm opened 1 day ago

craddm commented 1 day ago

:white_check_mark: Checklist

:computer: System information

:package: Packages

List of packages ```noen acme==2.10.0 annotated-types==0.7.0 appdirs==1.4.4 Arpeggio==2.0.2 attrs==24.2.0 azure-common==1.1.28 azure-core==1.31.0 azure-identity==1.18.0 azure-keyvault-certificates==4.8.0 azure-keyvault-keys==4.9.0 azure-keyvault-secrets==4.8.0 azure-mgmt-compute==33.0.0 azure-mgmt-containerinstance==10.1.0 azure-mgmt-core==1.4.0 azure-mgmt-dns==8.1.0 azure-mgmt-keyvault==10.3.1 azure-mgmt-msi==7.0.0 azure-mgmt-rdbms==10.1.0 azure-mgmt-resource==23.1.1 azure-mgmt-storage==21.2.1 azure-storage-blob==12.23.1 azure-storage-file-datalake==12.17.0 azure-storage-file-share==12.18.0 certifi==2024.8.30 cffi==1.17.1 charset-normalizer==3.3.2 chevron==0.14.0 click==8.1.7 cryptography==43.0.1 -e git+https://github.com/craddm/data-safe-haven.git@a6a6993ad7bbcc02e8f05629fe0fd5ab7154a900#egg=data_safe_haven debugpy==1.8.6 dill==0.3.9 dnspython==2.6.1 fqdn==1.5.1 grpcio==1.60.2 idna==3.10 isodate==0.6.1 josepy==1.14.0 markdown-it-py==3.0.0 mdurl==0.1.2 msal==1.31.0 msal-extensions==1.2.0 msrest==0.7.1 oauthlib==3.2.2 parver==0.5 portalocker==2.10.1 protobuf==4.25.5 psycopg==3.2.3 pulumi==3.134.1 pulumi_azure_native==2.63.0 pulumi_random==4.16.6 pycparser==2.22 pydantic==2.9.2 pydantic_core==2.23.4 Pygments==2.18.0 PyJWT==2.9.0 pyOpenSSL==24.2.1 pyRFC3339==1.1 pytz==2024.2 PyYAML==6.0.2 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 semver==2.13.0 setuptools==75.1.0 shellingham==1.5.4 simple_acme_dns==3.1.0 six==1.16.0 typer==0.12.5 typing_extensions==4.12.2 urllib3==2.2.3 validators==0.28.3 websocket-client==1.8.0 ```

:no_entry_sign: Describe the problem

Virtual machine sizes with local storage (mounted at /mnt) clash with NFS mounts defined in /etc/fstab.

:deciduous_tree: Log messages

Relevant log messages ![image](https://github.com/user-attachments/assets/6a47c817-c91b-467a-959d-619a9806b662)

:recycle: To reproduce

Deploy a workspace using a VM size with local storage.

jemrobinson commented 1 day ago

I saw the same problem when we first tested #2092. @JimMadge : can you take a look?

craddm commented 1 day ago

Same missing setting you report there - ldap missing from group and passwd image

JimMadge commented 1 day ago

Looks like the problem was you already have a disk mounted at /mnt. What size are you using?

jemrobinson commented 1 day ago

@JimMadge : Is something automatically mounted at /mnt? Possibly the (badly documented) temp disk that lives on the same physical machine as the VM?

This might also explain why mounting at /shared was fine but /mnt/shared causes a problem.

craddm commented 1 day ago

Standard_D2s_v3

btw I literally just got it running by creating the /mnt subdirectories from the console. Maybe adding the -m flag to mount -fav, so it creates the directories if they don't exist, will fix it?

JimMadge commented 1 day ago

Hmm, I'm not sure. That size shouldn't have a temporary disk.

You can see in the fstab output though that there is device /dev/disk/cloud/azure_resource-part-1 at /mnt. That doesn't look right to me. I haven't seen this problem in any deployments I've done recently.

jemrobinson commented 1 day ago

Is it worth abandoning /mnt and putting our drives somewhere else?

Or alternatively, explicitly adding something like the following to /etc/fstab

/dev/disk/cloud/azure_resource-part1    /mnt/tmp    auto    defaults,nofail,_netdev 0   2
JimMadge commented 1 day ago

I'd rather understand what is happening here. It isn't a mount that we have defined and I'm not sure what it is.

If it is something Azure is adding we could make sure to remove entries like that from fstab (if that is the problem here).

jemrobinson commented 1 day ago

This end of this section says "For Linux VMs, the temporary disk is /dev/sdb1 and is mounted at /mnt/resource or /mnt.", so I'm not sure we can disable this.

According to this outdated answer it might be possible to change in waagent.conf. However, I'm not confident that we could use cloud-init to change that file since waagent is used to run cloud-init.

Otherwise, we should see whether adding an explicit mount point for /dev/disk/cloud/azure_resource-part1 will fix it (as above).

JimMadge commented 1 day ago

Oh wait, the Dvs3 series do have temporary disks.

Is there a reason to use such an old offering?

jemrobinson commented 1 day ago

I think we should be robust against the use of VM SKUs that have temporary disks (which is most of them) regardless of whether Dvs3 is a sensible SKU to use.

craddm commented 1 day ago

It's just the one that was our default on the old codebase, so I still use it as a default out of habit, but as @jemrobinson, should be robust to stuff like this IMO. (I think the GPU VMs we currently recommend -e.g. Stanard_NC6s_v3 also have local temp disks?)

JimMadge commented 1 day ago

(which is most of them)

I'm not sure if that is true, it tends to be the older offerings. Looks like the high performance sizes and those with accelerators have local disks though, sometimes NVMe.

jemrobinson commented 1 day ago

From here

Most VMs contain a temporary disk, which is not a managed disk.

JimMadge commented 1 day ago

I'm sure it has been true but the trend in the current and new general purpose sizes is to not include local storage and provide 'd' series variants for those that want it. Noting that line has been in the docs for quite a few years. That said, I don't think I want to enumerate all the available sizes or sizes*availability 😅.

It does seem common on sizes with accelerators though. That makes sense as the users would likely want fast storage, and the physical nodes in the data centre are more likely to have onboard storage.

We'll need to fix this to enable GPU/FPGA sizes.

craddm commented 1 day ago

Adding -o X-mount.mkdir fixed this on the Standard_D2s_v3 at least, although I'm still unable to login

e.g. while (! mountpoint -q /mnt/input); do sleep 5; mount -o X-mount.mkdir /mnt/input; done

JimMadge commented 1 day ago

What does that option do?

I think I'd rather just not mount local storage at /mnt so that we are consistent with machine with and without local storage.

craddm commented 1 day ago

It creates the directory if it doesn't exist

jemrobinson commented 1 day ago

Alias --mkdir might be more clear?

craddm commented 1 day ago

Alias --mkdir might be more clear?

it would be, but for some reason that's not a supported option on the machine I tested

jemrobinson commented 1 day ago

This doesn't really deal with the question of what happens in these scenarios:

OR

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

JimMadge commented 1 day ago

I think the best solution would be that we standardise where our mounts and temp disk(s) are in all cases.

I like having our mounts at /mnt, it feels idiomatic. I would put local disks at something like /scratch, /var/scratch, /mnt/scratch.

We might need some logic like "if /dev/disk/cloud/... then ..."

jemrobinson commented 1 day ago

Agreed that /var/scratch or /mnt/scratch makes sense for the temp disk. Have you tried the fstab line I put above?

craddm commented 1 day ago

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

Looks like it's possible to change the location of the temp disk in /etc/waagent.conf:

There's a field ResourceDisk.MountPoint which is set to /mnt on our VMs

https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/agent-linux

jemrobinson commented 1 day ago

We might be going in circles here, but since waagent is used to run cloud-init, we probably can't use a cloud-init command to change the configuration.

JimMadge commented 23 hours ago

Can we give arguments to waagent when deploying the machine?

craddm commented 19 hours ago

So, we can add ephemeral0 to the mounts: section of our cloud-init (see here.

[ephemeral0, null] effectively doesn't mount it [ephemeral0, /mnt/resource] mounts it to /mnt/resource. So we could mount it to /mnt/scratch or whatever you'd prefer. Works whether the VM really has an ephemeral disk or not - just an empty folder when there is no real extra disk.

At this point the desktop icons for input, output and shared are not working