alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 15 forks source link

Path clash with Azure local storage #2221

Closed craddm closed 1 month ago

craddm commented 1 month ago

:white_check_mark: Checklist

:computer: System information

:package: Packages

List of packages ```noen acme==2.10.0 annotated-types==0.7.0 appdirs==1.4.4 Arpeggio==2.0.2 attrs==24.2.0 azure-common==1.1.28 azure-core==1.31.0 azure-identity==1.18.0 azure-keyvault-certificates==4.8.0 azure-keyvault-keys==4.9.0 azure-keyvault-secrets==4.8.0 azure-mgmt-compute==33.0.0 azure-mgmt-containerinstance==10.1.0 azure-mgmt-core==1.4.0 azure-mgmt-dns==8.1.0 azure-mgmt-keyvault==10.3.1 azure-mgmt-msi==7.0.0 azure-mgmt-rdbms==10.1.0 azure-mgmt-resource==23.1.1 azure-mgmt-storage==21.2.1 azure-storage-blob==12.23.1 azure-storage-file-datalake==12.17.0 azure-storage-file-share==12.18.0 certifi==2024.8.30 cffi==1.17.1 charset-normalizer==3.3.2 chevron==0.14.0 click==8.1.7 cryptography==43.0.1 -e git+https://github.com/craddm/data-safe-haven.git@a6a6993ad7bbcc02e8f05629fe0fd5ab7154a900#egg=data_safe_haven debugpy==1.8.6 dill==0.3.9 dnspython==2.6.1 fqdn==1.5.1 grpcio==1.60.2 idna==3.10 isodate==0.6.1 josepy==1.14.0 markdown-it-py==3.0.0 mdurl==0.1.2 msal==1.31.0 msal-extensions==1.2.0 msrest==0.7.1 oauthlib==3.2.2 parver==0.5 portalocker==2.10.1 protobuf==4.25.5 psycopg==3.2.3 pulumi==3.134.1 pulumi_azure_native==2.63.0 pulumi_random==4.16.6 pycparser==2.22 pydantic==2.9.2 pydantic_core==2.23.4 Pygments==2.18.0 PyJWT==2.9.0 pyOpenSSL==24.2.1 pyRFC3339==1.1 pytz==2024.2 PyYAML==6.0.2 requests==2.32.3 requests-oauthlib==2.0.0 rich==13.8.1 semver==2.13.0 setuptools==75.1.0 shellingham==1.5.4 simple_acme_dns==3.1.0 six==1.16.0 typer==0.12.5 typing_extensions==4.12.2 urllib3==2.2.3 validators==0.28.3 websocket-client==1.8.0 ```

:no_entry_sign: Describe the problem

Virtual machine sizes with local storage (mounted at /mnt) clash with NFS mounts defined in /etc/fstab.

:deciduous_tree: Log messages

Relevant log messages ![image](https://github.com/user-attachments/assets/6a47c817-c91b-467a-959d-619a9806b662)

:recycle: To reproduce

Deploy a workspace using a VM size with local storage.

jemrobinson commented 1 month ago

I saw the same problem when we first tested #2092. @JimMadge : can you take a look?

craddm commented 1 month ago

Same missing setting you report there - ldap missing from group and passwd image

JimMadge commented 1 month ago

Looks like the problem was you already have a disk mounted at /mnt. What size are you using?

jemrobinson commented 1 month ago

@JimMadge : Is something automatically mounted at /mnt? Possibly the (badly documented) temp disk that lives on the same physical machine as the VM?

This might also explain why mounting at /shared was fine but /mnt/shared causes a problem.

craddm commented 1 month ago

Standard_D2s_v3

btw I literally just got it running by creating the /mnt subdirectories from the console. Maybe adding the -m flag to mount -fav, so it creates the directories if they don't exist, will fix it?

JimMadge commented 1 month ago

Hmm, I'm not sure. That size shouldn't have a temporary disk.

You can see in the fstab output though that there is device /dev/disk/cloud/azure_resource-part-1 at /mnt. That doesn't look right to me. I haven't seen this problem in any deployments I've done recently.

jemrobinson commented 1 month ago

Is it worth abandoning /mnt and putting our drives somewhere else?

Or alternatively, explicitly adding something like the following to /etc/fstab

/dev/disk/cloud/azure_resource-part1    /mnt/tmp    auto    defaults,nofail,_netdev 0   2
JimMadge commented 1 month ago

I'd rather understand what is happening here. It isn't a mount that we have defined and I'm not sure what it is.

If it is something Azure is adding we could make sure to remove entries like that from fstab (if that is the problem here).

jemrobinson commented 1 month ago

This end of this section says "For Linux VMs, the temporary disk is /dev/sdb1 and is mounted at /mnt/resource or /mnt.", so I'm not sure we can disable this.

According to this outdated answer it might be possible to change in waagent.conf. However, I'm not confident that we could use cloud-init to change that file since waagent is used to run cloud-init.

Otherwise, we should see whether adding an explicit mount point for /dev/disk/cloud/azure_resource-part1 will fix it (as above).

JimMadge commented 1 month ago

Oh wait, the Dvs3 series do have temporary disks.

Is there a reason to use such an old offering?

jemrobinson commented 1 month ago

I think we should be robust against the use of VM SKUs that have temporary disks (which is most of them) regardless of whether Dvs3 is a sensible SKU to use.

craddm commented 1 month ago

It's just the one that was our default on the old codebase, so I still use it as a default out of habit, but as @jemrobinson, should be robust to stuff like this IMO. (I think the GPU VMs we currently recommend -e.g. Stanard_NC6s_v3 also have local temp disks?)

JimMadge commented 1 month ago

(which is most of them)

I'm not sure if that is true, it tends to be the older offerings. Looks like the high performance sizes and those with accelerators have local disks though, sometimes NVMe.

jemrobinson commented 1 month ago

From here

Most VMs contain a temporary disk, which is not a managed disk.

JimMadge commented 1 month ago

I'm sure it has been true but the trend in the current and new general purpose sizes is to not include local storage and provide 'd' series variants for those that want it. Noting that line has been in the docs for quite a few years. That said, I don't think I want to enumerate all the available sizes or sizes*availability 😅.

It does seem common on sizes with accelerators though. That makes sense as the users would likely want fast storage, and the physical nodes in the data centre are more likely to have onboard storage.

We'll need to fix this to enable GPU/FPGA sizes.

craddm commented 1 month ago

Adding -o X-mount.mkdir fixed this on the Standard_D2s_v3 at least, although I'm still unable to login

e.g. while (! mountpoint -q /mnt/input); do sleep 5; mount -o X-mount.mkdir /mnt/input; done

JimMadge commented 1 month ago

What does that option do?

I think I'd rather just not mount local storage at /mnt so that we are consistent with machine with and without local storage.

craddm commented 1 month ago

It creates the directory if it doesn't exist

jemrobinson commented 1 month ago

Alias --mkdir might be more clear?

craddm commented 1 month ago

Alias --mkdir might be more clear?

it would be, but for some reason that's not a supported option on the machine I tested

jemrobinson commented 1 month ago

This doesn't really deal with the question of what happens in these scenarios:

OR

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

JimMadge commented 1 month ago

I think the best solution would be that we standardise where our mounts and temp disk(s) are in all cases.

I like having our mounts at /mnt, it feels idiomatic. I would put local disks at something like /scratch, /var/scratch, /mnt/scratch.

We might need some logic like "if /dev/disk/cloud/... then ..."

jemrobinson commented 1 month ago

Agreed that /var/scratch or /mnt/scratch makes sense for the temp disk. Have you tried the fstab line I put above?

craddm commented 1 month ago

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

Looks like it's possible to change the location of the temp disk in /etc/waagent.conf:

There's a field ResourceDisk.MountPoint which is set to /mnt on our VMs

https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/agent-linux

jemrobinson commented 1 month ago

We might be going in circles here, but since waagent is used to run cloud-init, we probably can't use a cloud-init command to change the configuration.

JimMadge commented 1 month ago

Can we give arguments to waagent when deploying the machine?

craddm commented 1 month ago

So, we can add ephemeral0 to the mounts: section of our cloud-init (see here.

[ephemeral0, null] effectively doesn't mount it [ephemeral0, /mnt/resource] mounts it to /mnt/resource. So we could mount it to /mnt/scratch or whatever you'd prefer. Works whether the VM really has an ephemeral disk or not - just an empty folder when there is no real extra disk.

At this point the desktop icons for input, output and shared are not working