Objective JH-8: Support workshop users using Hubs

Motivation

Users are running workshops with many concurrent users on the hubs
This can sometimes pose unique challenges depending on sizes of datasets they plan to work with and other factors
Users need guidance from an infrastructural stand-point on best practices
Often, we will want some special-case configuration on infrastructure for the duration of workshops (eg. increased size of temp directories for users for certain use-cases)

Owner(s)

@batpad @yuvipanda

Success criteria

[x] Have workshops using Hubs run smoothly without infrastructural hiccups
[x] Workshop conductors feel supported and not have to figure out complex infrastructure issues by themselves

I think we had a good test run of workshop support this quarter with supporting the Summer School on Inverse Modeling of Greenhouse Gases 2024 workshop (SSIM-GHG).

We were in touch with Sourish Basu early on and were able to address specific needs of the workshop, and I think this experience can greatly help streamline our processes to support future workshops. Let's find a good place to document specific bits of workshop, but I'll outline here broadly the things we did, as well as things we could possibly do better in the future.

It was extremely helpful that Sourish started testing the hub infrastructure well in advance and was able to articulate their specific computing needs, and we had sufficient time to test the specific profiles we created for the workshop.

Shared Folder / File Access and EFS speed

One of the initial concerns was that there would be many students concurrently reading large files over the EFS share. We discussed the various options, including having students download files from s3. In the end, we settled on having a shared folder over EFS, and having students copy files to a local /tmp/ folder when they needed faster access.

It was helpful to know things like:

What is the volume of data that students would need to download / actively work on?
How many students will concurrently need to access this data?
Do the students only need to do read operations on the data, or also write operations?

We determined that students would have enough space on their local /tmp/ folders to fit the files needed, and since they only needed to read files from the shared folder, we could setup a system where Admins could put files into a shared folder, and students could copy those files into their local /tmp/ directories for faster access.

It does seem like the speed of the EFS share was a bit of a bottle-neck during the workshop, and we received this feedback:

First, reading from ~/shared was definitely slow when multiple people were trying to read the same file. To be specific, reading a ~2 GB file when no one else was reading took ~15 s, while it took ~2 minutes when ~20 people were trying to read the file at once (and that’s a long waiting time in front of a class of students). So for future workshops, we might want to think of an alternative read mechanism.

This is a known problem with EFS. What we could have done better here is done more thorough testing before the workshop so that expectations around data transfer speeds were clearer. In the future, we want to explore alternatives to EFS, and this use-case would be good to keep in mind.

Custom Profiles based on Specific Computational Needs

It was very helpful to have Sourish think deeply about the type of operations they wanted students to perform during the workshop, and run some tests to help determine the sizing of the containers, with regards to RAM and CPU allocations. It was determined that the heavier operations for the workshop were CPU-constrained more than RAM constrained, and the existing underlying Node Pool was using instances that were memory-optimized rather than CPU optimized.

Based on these specific needs, we were able to configure a custom node pool with compute-optimized instances, and create specific profiles just for the workshop which set the default resource requirements. By creating custom profiles specifically for the workshop, and restricting access to the default profile options to users in the workshop group, we were able to reduce confusion in environment setup for the workshop students.

Custom Images for Custom Environment requirements

Sourish had some requests for custom packages for both the Python and R images used for the workshop. As part of #16, we had worked on simplifying our image build and publishing setup, and the workshop was a good real-world use-case to create custom images and provide a good template for similar use-cases in the future.

We created https://github.com/NASA-IMPACT/ssim-ghg-workshop-2024-python-image/ and https://github.com/NASA-IMPACT/ssim-ghg-workshop-2024-r-image for the Python and R images, respectively. Both use the same base images that we use for the default Python and R images on VEDA, but have custom environment.yaml files to specify custom packages. Hopefully these repositories and the CI setup can form useful templates for other scenarios where we want to do something similar.

With the custom permission-scoped profile options, we were able to offer these custom images by default to workshop users, without affecting the default profiles for other users.

Permissions and Access

The model of being able to add users to particular Github teams to identify them as "workshop users" worked well. We were able to then use permission-scoping in the infrastructure configuration to restrict access to certain profiles based on Github team membership.

This is the PR that setup the configuration for the workshop-specific profile options: https://github.com/2i2c-org/infrastructure/pull/4100/files

This also helped us to test "tiered-access" to specific user groups, toward #19 .

Learnings and Future

Supporting the SSIM-GHG workshop was a great learning experience and early and direct communication with workshop organizers was hugely useful to be able to get into details and dig into solutions that were feasible to implement and solved real end-user problems.

This did end up taking a fair bit of our time, but a bit part of it was because we were doing things like configuring custom node-pools, profile options and creating custom images for the first time (for some of us). I feel good about streamlining this process greatly down the line.

In the coming quarter, I'd like to see us better formalize the workshop process to allow us to better scale our support for workshops. We can use the experience from the SSIM-GHG workshop to come up with a list of questions to be answered in advance, and document examples of setting up groups, profile permissions and custom images.

@freitagb @wildintellect it will be nice to discuss where we should collate documentation related to running workshops. I imagine the infrastructure stuff above is one part of it, but there are probably other things to think about, and it might be nice to collate a "workshop handbook" or similar somewhere?

Pasting below feedback from Sourish Basu, the workshop conductor for the SSIM-GHG workshop. Overall, things seemed to have gone well, but there is some very useful feedback on the experience and I'll work on figuring how best to ticket / incorporate into our future work-plans:

Indeed, things worked as expected, thanks for your help. Having the GHG Center compute option greatly simplified the compute portion of the workshop and made it go very smoothly! Here are some salient points for debriefing:

Making me a temporary admin who could add users was a huge help. Even though we tried to set everything up well in advance, there were multiple students (and one instructor) who had login issues. So I ended up resolving those on the spot.

Having said that, now that the workshop is over, please feel free to revoke my admin privileges.

Compute on the hub worked as expected. It was great that Yuvi/Sanjay/Tarashish were able to make custom images for the workshop, that simplified things.

I was concerned about concurrent reads from ~/shared. This proved to be a valid concern. With 1-3 users on the hub, reading a 2 GB file from ~/shared took ~15 seconds. However, when the students were going through the exercises there were 15-20 concurrent users trying to read the same file. This slowed the read down to ~2 minutes. Before the workshop we had discussed alternative options such as hosting the files in an S3 bucket, then downloading them onto a local (non-NFS) filesystem. We rejected those as being too complicated and not useful pedagogically, but perhaps we need to revisit those options for future workshops.

The R image proved to be less stable than the Python image for whatever reason. There were intermitted crashes of the R kernel. Unfortunately, these were not always reproducible, and the only fix we found was restarting the server (not just the kernel).

I had mentioned to Sanjay that it would be good if the “Logout” option also stopped the server, since from a UX point of view that makes most sense. This did trip up students, who kept trying to switch between python and R images by logging out. You might want to implement that on the hub.

Also, for future workshops we should look into building a custom image that can run a Python or an R kernel, i.e., without having to switch docker images (kind of like one’s personal computer, which one does not need to shut down to switch from python to R). This is because there were a few modules where it would have been good to get the students to run R and python codes side by side to compare outputs. We did not foresee this need, else we could have planned for it.

Thanks much to @yuvipanda @slesaad and @sunu for all your work on this and many many thanks to Sourish for all the detailed coordination and feedback!

cc @wildintellect - please let know if any of the feedback points mentioned above especially resonate and are things that we should definitely ticket. Thanks!

@batpad most of that feedback is highly valuable and actionable in the next PI.

Indeed, things worked as expected, thanks for your help. Having the GHG Center compute option greatly simplified the compute portion of the workshop and made it go very smoothly! Here are some salient points for debriefing:

Making me a temporary admin who could add users was a huge help. Even though we tried to set everything up well in advance, there were multiple students (and one instructor) who had login issues. So I ended up resolving those on the spot.

+1

Having said that, now that the workshop is over, please feel free to revoke my admin privileges.

Compute on the hub worked as expected. It was great that Yuvi/Sanjay/Tarashish were able to make custom images for the workshop, that simplified things.

I was concerned about concurrent reads from ~/shared. This proved to be a valid concern. With 1-3 users on the hub, reading a 2 GB file from ~/shared took ~15 seconds. However, when the students were going through the exercises there were 15-20 concurrent users trying to read the same file. This slowed the read down to ~2 minutes. Before the workshop we had discussed alternative options such as hosting the files in an S3 bucket, then downloading them onto a local (non-NFS) filesystem. We rejected those as being too complicated and not useful pedagogically, but perhaps we need to revisit those options for future workshops.

We need to revisit how this is implemented and compare to MAAP which use a mix of EFS home directories and S3Fuse mounted shared directories.

The R image proved to be less stable than the Python image for whatever reason. There were intermitted crashes of the R kernel. Unfortunately, these were not always reproducible, and the only fix we found was restarting the server (not just the kernel).

I'm very curious about this. Do we know how much RAM was selected? I would never start an R session with less than 4 GB of Ram, and if doing anything Geospatial move to min of 16 GB. When I've run Rstudio servers in the past I tried to give users 32-64GB on a regular basis. This is very different from Python.

I had mentioned to Sanjay that it would be good if the “Logout” option also stopped the server, since from a UX point of view that makes most sense. This did trip up students, who kept trying to switch between python and R images by logging out. You might want to implement that on the hub.

++ Need to document and look for easier ways to switch between instances. One of few perks of EclipseChe.

Also, for future workshops we should look into building a custom image that can run a Python or an R kernel, i.e., without having to switch docker images (kind of like one’s personal computer, which one does not need to shut down to switch from python to R). This is because there were a few modules where it would have been good to get the students to run R and python codes side by side to compare outputs. We did not foresee this need, else we could have planned for it.

MAAP has this, the R workspace in MAAP is actually conda with python first, and R on top, so that things like reticulate work. The downside is that would not be a pure Rocker install. Wonder if they have an image with python also? On MAAP we recommend people just switch back and forth between workspaces, or install a custom env for python inside the R image.

NASA-IMPACT / veda-jupyterhub