Docker and CI - Githubissues

fernandogelin commented 4 years ago

This PR acts like a RFC for the new process of creating environments, building and pushing docker images to GCR for use in JupyterHub.

Overview

We use Github Actions and Docker Compose to create the environments, build, and push the docker images. This new process will replace the previous process outline in the docker-stacks repository. By moving the creation of the Docker images to Github actions, we cut down the time we spend building and waiting for these images to build. Often when we do this in our personal computers, the images end up taking too much disk space and eventually docker will complain. This will not be a problem with the new process.

The GH Actions way

To create an image to be used in JupyterHub for a particular class, we need these components:

Shared Components

Dockerfile: the base dockerfile to create the image. The file currently being used is from Berkley.
docker-compose.yml: docker compose file with two services. One creates the conda environment files using brownccv/jh_sandbox:0.1.0. This step will generate and write the conda environment files. These files will be uploaded as artifacts so students can use them to reproduce the JH environment. The second step uses the environment file generated in step one to build the image and push to GCR (Google Container Registry).
scripts/: contains the scripts needed by the image. Currently it has scripts needed by the Berkley image and the ones needed for the Jupyter official images.

Exclusive Components

Each class has the following exclusive components:

className/requirements.txt: the requirement file with the packages needed to create the conda environment.
.github/workflow/className.yml: the github action workflow. One workflow per class will make the environment files artifacts easier to find. In addition, it allows us to run the workflow conditionally on changes related to a single class.

General Notes

If the Dockerfile for a class needs extra steps, these should be added as stages in the same Dockerfile and an extra service should be added to docker-compose.yml with a target key.

The actions running on push will allow for a streamlined development, however, I would suggest that we tag releases for the images that are officially being used in production. The release workflow is not part of this PR and still needs to be created.

This process also can be moved to the same repo as the actual JupyterHub deployment code.

The secrets needed for this action were added to the Organization level, so they can easily be reused in case we create different repos.

mirestrepo commented 4 years ago

I think this looks great! The only things that are missing are probably to be part of future PRs (which you mentioned above):

Tagging strategy: If we keep using the strategy of a fixed tag pre semester (say fall-2020) so that we don't need to run a terraform update if there is an image update, then we could use something like the label of the branch being merged as a tag or something like that. See PR Labeler. if we want to create tags on commit hash, then we would need to pair with a terraform -apply - I'm just always terrified to do that in production Could also first tag as dev-fall2020 which is tested in the dev hub and then changed to the approbate class... We probably need to chat about this point
Steps for installing Julia and other plugins

fernandogelin commented 4 years ago

I think this looks great! The only things that are missing are probably to be part of future PRs (which you mentioned above):

* Tagging strategy: If we keep using the strategy of a fixed tag pre semester (say `fall-2020`) so that we don't need to run a terraform update if there is an image update, then we could use something like the label of the branch being merged as a tag or something like that. See [PR Labeler](https://github.com/TimonVS/pr-labeler-action). if we want to create tags on commit hash, then we would need to pair with a `terraform -apply` - I'm just always terrified to do that in production
  Could also first tag as `dev-fall2020` which is tested in the dev hub and then changed to the approbate class... We probably need to chat about this point

We can have a branch named as the semester (e.g. fall2020), and we push updates to it throughout the semester. In the action workflow we can pass ~${GITHUB_REF##*/}~ ${{ github.ref }} as a tag, that will get the branch name from the environment. We can also pass ~${GITHUB_SHA}~ ${{ github.sha }}if we want to tag with the commit sha. Then when the semester is over we can freeze and tag a release, and start a new branch for the next semester (or just rename the current branch and make changes there).

That PR Labeler action is only adding labels to PRs. Did you mean a different action?

* Steps for installing Julia and other plugins
I'll work on this one, but what I'm thinking is to use Docker's multi-stage builds. Then in the docker-compose.yml we can set the target, we can pass the target as an environment variable to each class workflow.

broarr commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

mirestrepo commented 4 years ago

The labeler what just if we didn't want to use branches, and instead use labels of a PR based on name. Branches are more flexible and cover more scenarios, but a bit more bookkeeping. I'm okay try branches or releases as a first pass

fernandogelin commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

What I'm thinking is to have something like this:

FROM image as base
# install all base things needed for Jupyter
# install python packages
FROM base as julia # this stage has python and julia
# install Julia packages
FROM base as r_lang # this stage has python and R
# install r pakages
FROM julia as julia_r # this stage has python, julia, and R
# install r packages

Then in the docker-compose for each class, we pass the target as an env variable.

broarr commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

What I'm thinking is to have something like this:
FROM image as base
# install all base things needed for Jupyter
# install python packages
FROM base as julia # this stage has python and julia
# install Julia packages
FROM base as r_lang # this stage has python and R
# install r pakages
FROM julia as julia_r # this stage has python, julia, and R
# install r packages
Then in the docker-compose for each class, we pass the target as an env variable.

Oh, I get it. I wasn't thinking about rolling the resulting container down the file. It feels a little weird to me. Is there an advantage to using multistage builds over build arguments?

fernandogelin commented 4 years ago

This looks rad! To create a new class, I just need to add a workflow file and a requirements file?

yes!

fernandogelin commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

What I'm thinking is to have something like this:
FROM image as base
# install all base things needed for Jupyter
# install python packages
FROM base as julia # this stage has python and julia
# install Julia packages
FROM base as r_lang # this stage has python and R
# install r pakages
FROM julia as julia_r # this stage has python, julia, and R
# install r packages
Then in the docker-compose for each class, we pass the target as an env variable.
Oh, I get it. I wasn't thinking about rolling the resulting container down the file. It feels a little weird to me. Is there an advantage to using multistage builds over build arguments?

I'm not sure. How are you envisioning the build arguments will work with this process?

broarr commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

What I'm thinking is to have something like this:
FROM image as base
# install all base things needed for Jupyter
# install python packages
FROM base as julia # this stage has python and julia
# install Julia packages
FROM base as r_lang # this stage has python and R
# install r pakages
FROM julia as julia_r # this stage has python, julia, and R
# install r packages
Then in the docker-compose for each class, we pass the target as an env variable.
Oh, I get it. I wasn't thinking about rolling the resulting container down the file. It feels a little weird to me. Is there an advantage to using multistage builds over build arguments?
I'm not sure. How are you envisioning the build arguments will work with this process?

I guess I was thinking something like:

FROM image
ARG WITH_JULIA=false
ARG WITH_R=false

# Install base requirements

RUN if [ "${WITH_JULIA}" = "true" ]; then \
  apt-get update && apt-get install -y \
  # install julia stuff here \
  fi

RUN if [ "${WITH_R}" = "true" ]; then \
  # install r stuff here \
  fi

fernandogelin commented 4 years ago

If I'm understanding correctly, I'm not sure that multistage builds are a good solution for plugins. Each container will start clean, if you want Julia and Python and you're using a multistage build, you'll have to copy the Python artifacts to Julia or vice-versa. I'd instead suggest using build arguments to specify what goes into the docker images

What I'm thinking is to have something like this:
FROM image as base
# install all base things needed for Jupyter
# install python packages
FROM base as julia # this stage has python and julia
# install Julia packages
FROM base as r_lang # this stage has python and R
# install r pakages
FROM julia as julia_r # this stage has python, julia, and R
# install r packages
Then in the docker-compose for each class, we pass the target as an env variable.
Oh, I get it. I wasn't thinking about rolling the resulting container down the file. It feels a little weird to me. Is there an advantage to using multistage builds over build arguments?
I'm not sure. How are you envisioning the build arguments will work with this process?
I guess I was thinking something like:
FROM image
ARG WITH_JULIA=false
ARG WITH_R=false

# Install base requirements

RUN if [ "${WITH_JULIA}" = "true" ]; then \
  apt-get update && apt-get install -y \
  # install julia stuff here \
  fi

RUN if [ "${WITH_R}" = "true" ]; then \
  # install r stuff here \
  fi

oh I see, I like this too. Not sure what the pros and cons are for ARGS vs multi-stage.

mirestrepo commented 4 years ago

Choose whatever method does a better job at caching!

broarr commented 4 years ago

I'm not sure how the intermediate containers are handled with cache. I think they're not cached? I'm gonna have to look that up

broarr commented 4 years ago

https://pythonspeed.com/articles/faster-multi-stage-builds/

That was a helpful read. By default intermediate stages are not cached, you need to tag them and push them separately and explicitly ask docker to use that as part of it's cache. This could slow down your build times if you don't push the intermediate stages

fernandogelin commented 4 years ago

https://pythonspeed.com/articles/faster-multi-stage-builds/

That was a helpful read. By default intermediate stages are not cached, you need to tag them and push them separately and explicitly ask docker to use that as part of it's cache. This could slow down your build times if you don't push the intermediate stages

ah, good to know! Thanks for that.

mirestrepo commented 4 years ago

But that could be more reusable than with the ARGS, because in the end different classes may have different permutations of the stages.... does that sound right?

fernandogelin commented 4 years ago

But that could be more reusable than with the ARGS, because in the end different classes may have different permutations of the stages.... does that sound right?

I think ARGS are more easily reusable in this case, because we can just pass the ARGS needed for that specific class in the workflow. With multi-stage we would end up creating more stages if a class depends in multiple stages but not all. I guess we can experiment when we add more complex classes.

broarr commented 4 years ago

I think neither is going to cache well. Intermediates aren't cached by default, and the ARGs cache will be invalidated each time someone changes WITH_JULIA=false to WITH_JULIA=true. If I had to guess (and that's all it is right now), I'd guess that the naive ARGs solution would cache better, but you could do complicated magic with the multistage builds to get better performance in the long term

mcmcgrath13 commented 4 years ago

maybe also have the class specific, optional, files of: classname/Project.toml (or JuliaProject.toml) classname/RInstall.R (just an R file with all the install package commands needed)

fernandogelin commented 4 years ago

This all awesome!

With the multi-staged/build args, would it be possible to do something like COPY --from=julia:1.5 <this things that are a julia install> or similarly lean on those images in a multi-staged build?

maybe also have the class specific, optional, files of: classname/Project.toml (or JuliaProject.toml) classname/RInstall.R (just an R file with all the install package commands needed)

for Julia yes, and it's in the other PR. But for R, no, the r packages are installed with conda.

mcmcgrath13 commented 4 years ago

https://www.docker.com/blog/advanced-dockerfiles-faster-builds-and-smaller-images-using-buildkit-and-multistage-builds/ this looks maybe relevant as well

brown-ccv / jupyterhub-docker-images

Docker and CI #1

Overview

The GH Actions way

Shared Components

Exclusive Components

General Notes