Enhance user experience around cloud storage via better integration with jupyterlab and notebook server environment

rabernat commented 2 years ago

Context

In https://github.com/2i2c-org/docs/pull/138 we started to write some user-facing documentation around how to work with storage in the cloud. This now appears at https://docs.2i2c.org/en/latest/user/storage.html. Some relevant tidbits

Your hub lives in the cloud. The preferred way to store data in the cloud is using cloud object storage, such as Amazon S3 or Google Cloud Storage. ... From a user perspective, the main challenge of working with object storage is the need to use more specialized tools, rather than just simple files / filenames, to manage data.

In 2i2c-org/features#9 we are tracking the idea that hub admins should be able to create cloud storage buckets for hub users, possibly with group-level credentials.

In this issue, I am proposing several UI / UX enhancements that will empower users to take better advantage of cloud storage. The impact of this will be to make our users more effective "cloud native" data scientists.

Proposal

We should do the following:

[ ] Install the jupyterlab-s3-browser extension on notebook images, which permits users to browse buckets. In https://github.com/pangeo-data/pangeo-docker-images/issues/310 we are trying to add this to the pangeo-notebook image.
[ ] Auto-populate credentials to needed access buckets in the user environment
[ ] Auto-populate a list of buckets the user should be able to browse. (Probably requires https://github.com/IBM/jupyterlab-s3-browser/issues/18.)
[ ] Document this feature in 2i2c docs

Updates and actions

No response

rabernat commented 2 years ago

Now that https://github.com/pangeo-data/pangeo-docker-images/issues/310 is merged, we could be using Pangeo images with the s3 browser installed.

sgibson91 commented 2 years ago

@rabernat what's the tag for that image? I just enabled an action that will auto-bump the pangeo images for the three pangeo-like hubs. It creates PRs like this: https://github.com/2i2c-org/infrastructure/pull/1407

rabernat commented 2 years ago

what's the tag for that image?

There hasn't been a release yet. The usually happen about once a week.

I just enabled an action that will auto-bump the pangeo images for the three pangeo-like hubs

It's great to see work happening on this important topic! 🚀 However, I have mixed feelings about the idea of automatically updating the image. The stack changes fast enough that this can break user code, leading to serious frustration. In my experience, users absolutely hate it when code that was working one day stops working the next, for reasons that are not the user's fault. This has definitely happened in the past when I manually updated the image.

To balance the desire to be able to use the latest image with the need to keep code reproducible, I think it is crucial that the Pangeo hubs have the ability to allow the users to select any of the past images from the spawner. This would mitigate the problem of breaking user code. Without such a feature, I would have to vote NO on automatically updating the images. Even better would be moving in the "binder for everything" direction, where we completely decouple the image from the profiles, and force the user to always explicitly specify an image.

Is there an issue where we can discuss this specifically?

sgibson91 commented 2 years ago

https://github.com/2i2c-org/infrastructure/issues/1338

sgibson91 commented 2 years ago

I think the work on the list is ongoing in https://github.com/2i2c-org/infrastructure/issues/1253, but this action workflow will be useful for things beyond pangeo images, e.g., it can replace this kind of manual PR too https://github.com/2i2c-org/infrastructure/pull/1403 (for minor releases that don't need as much babysitting). We will also use it to keep the version of repo2docker that a BinderHub will use up-to-date.

2i2c-org / features