tom-butler commented 4 years ago

Current Process:

On each server startup we clone the git repo on each to a temporary folder, remove some files (including .git) and copy the contents to the home directory of the notebook user with rsync.

This way of working ensures that there are no git merge conflicts causing issues with the user files

However it has some downsides:

Users cannot pull changes easily using the built in git tools
Any files that are removed from the notebooks git repo will be kept around in user folders
Users cannot submit changes through the built in git tools
Repo is not stored in a folder in the user's home directory so it becomes awkward for cloning extra repositories, as they are cloned inside another repo

Alternatives

Investigate if we can change the default user directory to be a folder underneath home (like what happens when you disconnect and return)
Use an opinionated git pull (keep all user changed files)
Use nbgitpuller (Has some problems with merge conflicts)

Kirill888 commented 4 years ago

It should be possible to configure default path that gets displayed in the file browser first time user logs in, by generating default config for jupyterlab workspace:

https://github.com/pangeo-data/pangeo-example-notebooks/blob/master/binder/jupyterlab-workspace.json#L65

But I would not re-write it afterwards.

benjimin commented 4 years ago

I would like to see the repo in a subdirectory rather than the root of the user's home directory.

Is there any reason not to?

Motivation

We often have new starters and visiting collaborators, and these new sandbox users have a huge learning curve: simultaneously learning git, unix and python. I think it is bad that the first thing we instruct them is to create a repo inside another repo, since this adds further inception-esque confusion, and is a widely discouraged git practice (i.e. git already interprets repo subdirs in a special way; it is asking for trouble, and complicates learning how git will behave). Already their beginner-mistakes routinely lead to trying fairly serious git kung fu (e.g. resetting to past states, rewriting objects out of histories, etc) to un-break everything and recover their work.

I think we should instead encourage a best-practice workflow, that is as close as possible to what we want them to follow on other linux platforms like NCI (avoiding features specific to DEA-sandbox such as touch .nosync -- plenty of new users can't even discriminate between bash, git, python, and DEA-specific syntaxes).

Preference

I think we should have the repo pre-populated in ~/examples/. If it can be, fast forward it on log-in. If the user has dirtied it, leave it for the user to manage. (This is literally the first skill everyone learns with git anyway.)

Instruct all new users to create a different subdirectory, and there establish their own branch repo (but don't do it for them).

I don't even think it is necessary to have jupyter pre-navigate to the examples directory; I see no harm in expecting first-time users to intuitively click "examples". (In fact I think this is better than needing to explain ../, and it also avoids inconvenience on subsequent logons.)

alexgleith commented 4 years ago

I'm in favour of having the notebooks loaded into an ~/examples folder.

I think that having a "first start" README in the root/home folder is a good idea. And we already have some logic for a don't sync flag too (undocumented...).

I think that we must force overwrite the folder and shouldn't copy the .git. The examples folder is for new users, and it should always be clean and work. For folks doing dev, they should self-manage their own space <somewhere else>, for example, I just have a ~/dev/whavever-project folder on the sandboxes where I do actual dev.

I wasn't involved in the decision to pull out the examples into folders in the root of the project, but I'm aware there was a decision there.

robbibt commented 4 years ago

I'd also be happy with a ~/Examples folder, but if we went down that path, we would need to make sure the user is presented with a really nice, simple and easy to follow splash/readme page (preferably including some screenshots) that loads in the JupyterLab window as the first thing they see when their server starts up (I saw a demo of this functionality during the recent ODC hackathon so it should be possible).

This readme would need to walk the user through in baby steps, even to the level of:

"To begin exploring DEA functionality, double click the Examples folder in the file browser"
"To learn how to use DEA for the first time, navigate to the Examples/Beginners_guide folder and double click on 01_Jupyter_notebooks.ipynb to launch your first notebook"
"To start developing your own DEA applications, follow the DEA Notebooks' Guide to using DEA Notebooks with git"

(The readme could also be the place to include the warning that files in Examples will be overwritten automatically)

As long as this was clearly explained and shown to the user at start up, I think it could serve as a nice way to familiarise the user with using the file browsing interface and launching notebooks for the first time.

I'd be happy to work on the readme if we settled on this as an approach.

caitlinadams commented 4 years ago

I agree with a lot of the points made here, and particularly agree with Robbi's point that the user needs to have some clear guidance around how to use the sandbox and the ~/Examples folder. This could potentially even include a description of how the ~/Examples folder works, and a recommendation that they save copies of example notebooks back to somewhere in their home directory if they want to work on them.

As an additional thought, I've been using Amazon SageMaker recently, and they use a Jupyter Lab extension to manage their example notebooks. I think there are some upsides and downsides to this approach, but would be happy to discuss further with anyone that's interested. You can see a bit of a preview of how it works here: https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-nbexamples.html. One of the main benefits is that the examples are read-only, with a pop-up asking the user if they want to copy the file to their directory.

BexDunn commented 4 years ago

I think a README is an excellent idea. I don't have a strong feeling as to whether we move the sandbox examples to the examples folder or leave them as the top dir - but either way I think we need a README so that users have a good idea of what's what.

We might be able to create read-only notebooks without using SageMaker - https://coding-stream-of-consciousness.com/2018/11/12/read-only-protected-jupyter-notebooks/ - though it looks like it'll take a bit more effort. SageMaker looks interesting - do we know how the pricing compares to notebooks as they are on the sandbox?

caitlinadams commented 4 years ago

Hey @BexDunn -- sorry, my SageMaker link might have been a bit misleading. There's no need to use SageMaker specifically, it's just an example implementation of a Jupyter Lab extension that can handle a collection of example notebooks. The actual extension is here: https://github.com/danielballan/nbexamples

benjimin commented 4 years ago

Would another option be simply to symlink ~/examples to some shared directory (external to /home), where users do not even have sufficient privileges to dirty or mismanage that copy of the repo?

That would also be a place to administer README content that is specific to that sandbox infrastructure, without committing it to the general-purpose notebooks repo. It could remove any need for ongoing management/syncing of user home directories (and associated potential mess/confusion).

alexgleith commented 4 years ago

I don't think we want to stop people from being able to write to it, @benjimin. It makes the notebooks do weird things if they're read only, I think.

I like @caitlinadams' suggestion of using the nbexamples process. Either that or just sticking with the current process, but doing the sync into an examples folder.

benjimin commented 4 years ago

Also, a potential security motivation is that any files which should not be committed to any git history still get stored somewhere inside the working directory of a repo (i.e. training users to invite mistakes in sensitive data management).

GeoscienceAustralia / dea-sandbox

Re-work how the notebooks repo is pulled #79

Current Process:

Alternatives