berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
65 stars 39 forks source link

Switch image for stat20 hub #4846

Open andrewpbray opened 1 year ago

andrewpbray commented 1 year ago

Hi team!

I've been working with @ryanlovett to build a docker image for the curriculum of Stat 20: https://hub.docker.com/repository/docker/stat20/stat20-docker/general. This is being used to compile a public and staff-facing website containing the lecture notes, slides, and assignments for the course, as well as the course documentation. All of those source docs are in this repo: https://github.com/stat20/stat20.

My question: would it make sense for the stat20 hub to pull this same image? It is currently just a tiny bit bigger than the image currently served up to students for them to run RStudio and do their assignments, but barely (just a few R packages I think). Running both from the same image should simplify the maintenance of the image and help catch bugs that students might hit on the hub since the curriculum will be regularly run through CI using the same image.

Happy to hear your thoughts!

ryanlovett commented 1 year ago

One issue is that all of the datahub images are managed within the berkeley-dsep-infra/datahub repo and merge rights are limited to tech infrastructure admins including me. When instructors and GSIs need any of those images to be updated, they can make pull requests which tend to be merged fairly quickly. But if the textbook were to use the stat20 hub image, there would be a new dependency on another group of people.

The other issue, and maybe more critical, is that datahub images are pushed to the Google Container Registry at gcr.io, rather than docker hub. This is done to improve pulling performance since the hubs run on Google Cloud. We'd have to see if we can make the images public so anyone could docker pull them. Alternatively I could find a way to push the images to docker hub during CI.

Another possibility is to have the stat20 datahub consume the stat20-docker image. At times there have been considerations to move the development of hub user images to external repos where other people have access, but it can become a management problem. Datahub images need a common set of libraries to function on a jupyterhub and those dependencies are seeded by a script in the datahub repo. I think the datahub staff would have to plan out distributed image management before we could use this approach.

I think the first approach could work and would ensure everything is on the same stack. We'd just want to make sure that changes can be integrated in a timely manner. And we'd have to resolve the container registry issue.

ryanlovett commented 1 year ago

Another possibility is that people can do textbook development on the stat20 hub directly, without needing a local docker workflow. Changes to the image would still have to go through merge and CI in this repo.

We could also install https://github.com/jupyterhub/gh-scoped-creds which enables some users of the hub to push to the textbook repo without needing to setup PATs or ssh keys. This was used on the stat159 datahub. I could try setting up gh-scoped-creds on stat20 hub if you'd like to give you a sense of how it'd work.

andrewpbray commented 1 year ago

In light of our conversation today (most dev can be done not in the container), doing dev on the hub sounds better and better. The approach for most instructors would be: feel free to do your dev locally. If one of your PRs doesn't pass the checks, then 1. read the docs about how to file a PR to add the dependency to the image and 2. if it's not clear what the problem is, switch over to doing dev directly on the hub (that is, log into rstudio on the hub and pull down the branch you're working on and troubleshoot there).

ryanlovett commented 1 year ago

I think that'd be easiest. I spent some time with PATs today but didn't have as much time as I thought I would. I'll try to land that feature by Thursday.