det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

/cvmfs/data read-only when logged in through JupyterHub #21

Closed pibion closed 4 years ago

pibion commented 4 years ago

We're trying to use DataCat (http://titus.stanford.edu:8080/git/summary/?r=DataHandling/DataCat.git) to grab data files and store them on /cvmfs/data as needed.

However, this mode of copying data requires users logged in to JupyterHub to have write privileges in /cvmfs/data, which they currently don't. @thathayhaykid will follow up with a way to reproduce the issue.

@bloer, @zonca, do you have any thoughts on ways to handle this? Maybe we could update DataCat to connect to SLAC and run the copy from there. People would have to make sure they've got an ssh key and config set up properly, but that's maybe reasonable.

thathayhaykid commented 4 years ago

The program linked is what I wrote to try and save the data to the default directory, the error shown is the error I got, and the most important line you're looking at will be the dc.fetch(downloadThis), this is what downloads the requested dataset please let me know if you've got any questions! Thank you so much!

Program to run in Xsede

bloer commented 4 years ago

@pibion

Maybe we could update DataCat to connect to SLAC and run the copy from there.

I don't understand what you mean here. Users still need to put the data somewhere writeable, unless you mean to open up an sshfs mount or something.

The intended functioning is that users can get data on demand, but it's saved to a large common disk so that it only has to be downloaded once, even if another user wants it.

There are a few options: 1) Pre-download the data that you want and register it with an XSEDE site location. Path can be anywhere 2) Pre-download the data to /cvmfs/data/. No extra steps needed 3) Users download to home directory (probably don't want) or some per-node scratch space (in which case it might have to be re-downloaded every time it's wanted).

With (1) and/or (2) users can't get any data on demand, only already-transferred files. (1) and (2) can be used together, or (1) and (3), but right now you could not do (2) and (3) together. (The client can find data with a registered site path or in the target download location, but doesn't support multiple search roots).

bloer commented 4 years ago

I'm getting ready to publish a new software release...is there a better directory to use?

zonca commented 4 years ago

I think I should be able to configure /cvmfs/data to be writable. But any user then could wipe it out. Is it ok @pibion?

bloer commented 4 years ago

@zonca I think it should be OK, it would just mean re-downloading on demand data. You could also set the sticky bit on /cvmfs/data. Only problem I see there is if someone aborts a download and doesn't clean up partial files afterward someone would have to go in with root privileges to clean it up.

pibion commented 4 years ago

@zonca I agree with @bloer, it's only an inconvenience if someone wipes it out. But I'm not expecting that to happen often, so I think the data tool being able to grab data on demand is worth the risk.

Do we have ACL permissions? Is it possible for people in the SuperCDMS gitlab group to be in an ACL group?

@bloer would someone have to have root privileges to clean up the data?

bloer commented 4 years ago

@pibion If the sticky bit is set, users can only modify/delete their own files, no one else's.

zonca commented 4 years ago

see https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/pull/23