Open AlexAxthelm opened 8 months ago
Next steps, run a spike and set up a demonstration architecture to explore how much of our current system actually relies on a local filesystem (or something approximating it via mounts), and what can be moved to working with remote files.
overall, this probably will make a lot of the configuration of our cloud resources easier and more reliable, since rather than point to a local reference, we'll be pointing to URLs
Log messages like
exporting file to /mnt/rawdata/foo
will become things like
exporting file to pacta.blob.core.windows.net/rawdata/foo
and similarly our configs can point to the same.
The tricky part is going to be authentication (as always). I don't know if the simple URLs will work, or if we'll need to put an SAS in there somehow.
So ELIF5: rather than mounting anything in, due to the permissions awkwardness, you want to read data directly from an RO URL? and hopefully on the code side, all that would need to change is the path specification of the root file-storage? + some authentication handling?
due to the permissions awkwardness, you want to read data directly from an RO URL? and hopefully on the code side, all that would need to change is the path specification of the root file-storage?
I need to experiment a bit, but in theory, yeah. They could be read/write URLs (including to paths that don't exist yet). So instead of a code block that looks like this:
outputs_dir <- file.path("mnt/", "foo", 2022Q4")
mtcars_file <- file.path(outputs_dir, "mtcars.rds")
saveRDS(mtcars, mtcars_file)
we might have something like this (not tested):
outputs_destination <- file.path("pacta.blob.core.windows.net", "foo", 2022Q4")
mtcars_file <- file.path(outputs_destination, "mtcars.rds")
saveRDS(mtcars, mtcars_file)
- some authentication handling
In theory, the auth shouldn't need to live in the codebase, but rather in the deployment mechanisms. If we're running code in a container, and assign that container an identity with appropriate permissions (read-only, or read-write), then when anything on that system tries to read or write those files, Azure handles the auth.
Makes sense!
I'm discovering an issue with Azure File Shares, in that when mounting via SMB to a Linux OS (which we use exclusively in our cloud VMs and Containers), authentication is handled with a Storage Account key, rather than a Shared Access Signature.
Storage Account Keys have the unfortunate problems of being:
the second problem can be somewhat mitigated by being careful when mounting a file share by specifying permissions ( setting
file_mode=0555,dir_mode=0555
in thesudo mount -t cifs
command), but I don't want to rely on this as a long term solution.cc @cjyetman @hodie @jdhoffa