SwissDataScienceCenter / renku-python

A Python library for the Renku collaborative data science platform.
https://renku-python.readthedocs.io/
Apache License 2.0
37 stars 29 forks source link

enable FUSE mounting remote storage - pre-study #2912

Closed rokroskar closed 2 years ago

rokroskar commented 2 years ago

To better understand our options when it comes to enabling access to remote storage, we should understand the limitations of using FUSE in various scenarios:

Specifically we need to understand whether there are circumstances under which FUSE may simply not be an option for our users, e.g. under:

olevski commented 2 years ago

@rokroskar for testing this out on mac/linux/windows is it enough to spin up VMs with the OS and see if I can mount a S3, SFTP and NFS drives on each using the appropriate fuse-based utility? Or are you expecting that we should test how this would potentially work from within the renku CLI in more details?

As for the HPC cluster I guess we would do the same. Just spinning up a HPC cluster is a lot more complicated os it would be nice to just get access to an existing one.

rokroskar commented 2 years ago

I think VMs are fine. Just keep track of any potentially invasive use-access issues that need to be resolved to make fuse work.

We do have access to HPC clusters at EPFL and ETH. For ETH, anyone with a NETHZ account can log in to euler.ethz.ch (need to be on the VPN).

olevski commented 2 years ago

Ok cool. Here is a list then of publicly available buckets, NFS drives, SFTP that I will use to test:

olevski commented 2 years ago

NFS server setup

Dockerfile

Please note that you should create a text file called test-nfs-share-file.txt that will be shared.

FROM erichough/nfs-server
COPY test-nfs-share-file.txt /nfs_exports/
VOLUME /nfs_exports
ENV NFS_EXPORT_0='/nfs_exports    *(rw,no_subtree_check,anonuid=1001,anongid=1001)'

Build and run commands

docker build -t test-nfs-server:0.0.1 .
docker run -ti --rm --privileged --name nfs_server test-nfs-server:0.0.1

Mounting

  1. Find the ip of the docker container by running docker inspect <container_id> and looking under Networks.IPAddress
  2. Run `mount nfs :/nfs_exports /local_folder_to_mount
olevski commented 2 years ago

Linux observations

NFS:

FTP:

S3:

olevski commented 2 years ago

Windows

Since we do not support "regular" Windows I have decided to not pursue this further.

One slight problem with rclone that is that it requires a configuration to be written in a file so that it can be used. But if we decide to use rclone I do not think this is a show-stopper but rather an annoyance.

WSL2

WSL1

olevski commented 2 years ago

Mac

NFS:

FTP:

S3:

The biggest problem with Mac is that MacFuse is not open source and brew has stopped maintaining all formulae that require fuse. That means that installing tools to mount things through FUSE in Mac is not that simple and will probably not improve in the near future.

In addition installing MacFuse can be a pain. I currently have it in some weird broken state that I cannot fix. No matter how many times I delete and install it. I am not sure if this is simply me being extremely unlucky or other people have similar experiences. The weirdest part is that after you install fuse I think you need to restart your computer but right after your computer comes up you have to go into the security settings and approve the use of it. If you miss this then fuse will not work.

olevski commented 2 years ago

In summary:

After doing this excercise I am still very worried that adding this feature would require us to troubleshoot Rclone and Fuse installations for different users across many different OS's.

olevski commented 2 years ago

@rokroskar let me know what you think. The TLDR is right above this comment ☝

rokroskar commented 2 years ago

Thanks for this summary @olevski ... it definitely looks like relying on FUSE on the user installations is going to be a huge liability. Another thing that occurred to me is that for most users they probably have their networked drive mounted by some other means, e.g. for labs they would probably have something like the NAS available via an SMB share or something like that. So in those cases, we don't need to handle the mounting itself but only keep track of where the data is mounted to map it to what is being used for the given project. In the hosted sessions we can probably take care of the mounting and the book-keeping automatically.

So for working locally it should be

  1. copy the data, if possible
  2. user provides an existing location of the data (in the case of networked drives)
  3. ? not sure what other option there is - we could "fuse mount at your own risk"?