Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
21 stars 12 forks source link

[Story] Minimize the time it takes to initialize the dev container during CI #1975

Open tschaffter opened 10 months ago

tschaffter commented 10 months ago

What projects is this feature for?

No response

Description

Background

The CI workflow has recently been updated to use the Dev Container CLI to run tasks in the dev container. This ensures that the environment used by the CI workflow is the same as the development environment used by developers. This container also provides all the tools needed by the CI workflow and eliminates the need to maintain different versions of the tools in the CI workflow.

One drawback of the current implementation is that the dev container is not caching dependencies for Python, Java, Node.js, etc. This means that dependencies needs to be downloaded again from remote servers each time the CI workflow runs.

Initializing the dev container in the CI workflow is composed of two steps: 1) start the dev container and 2) run the command workspace-install. These two takes can take together up to 6 minutes.

Goal

The goal of this ticket is to explore means to minimize the initialization of the dev container. The tasks are:

Anything else?

No response

Code of Conduct

tschaffter commented 10 months ago

Time it takes to start the dev container

This workflow step takes slightly less than 2 minutes based on the runtime of past runs.

The Dev Container CLI performs the following operations:

  1. Download the base image of the dev container
  2. Build a new image: base image + install features + VS Code extensions + configuration defined in devcontainer.json

Option to reduce the start up time:

tschaffter commented 10 months ago

About stopping the dev container

This step takes 4 seconds, which is not much. It's also probably safer to shutdown the container properly.

tschaffter commented 10 months ago

GH Runner hardware resources

The CPU and RAM information matches the following GH runner, though we seem to have access to more storage.

Source: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners

Information about large runners are available here.

tschaffter commented 10 months ago

About downloading the base dev container image

Observation:

Questions:

tschaffter commented 10 months ago

About installing VS Code extension

I thought that the dev container CLI was installing the VS Code extensions specified in devcontainer.json when running devcontainer up. However, there is nothing in the logs that shows that extensions are installed. Remove the extensions from the definition file also does not seem to speed up devcontainer up.

tschaffter commented 10 months ago

Docker images are not cached

Cache Docker layers says:

Run actions/cache@v3
  with:
    path: /tmp/.buildx-cache
    key: Linux-single-buildx-82425368edcaf7acea638514ddd0cdc3809229bd
    restore-keys: Linux-single-buildx

    enableCrossOsArchive: false
    fail-on-cache-miss: false
    lookup-only: false
  env:
    NX_BRANCH: 1976
    NX_RUN_GROUP: 5881219943
    NX_CLOUD_AUTH_TOKEN: 
    NX_CLOUD_ENCRYPTION_KEY: 
    NX_CLOUD_ENV_NAME: linux
    NX_BASE: df9fdd5b6b76b1a367103020ac738406a403c7e3
    NX_HEAD: d6b57f4dbbc84b5e77489f011e9527fb717a329e
Cache not found for input keys: Linux-single-buildx-82425368edcaf7acea638514ddd0cdc3809229bd, Linux-single-buildx

Post Cache Docker Layers says:

Warning: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.

Images build by the dev container CLI

Run docker images
REPOSITORY                                                                                        TAG         IMAGE ID       CREATED          SIZE
vsc-sage-monorepo-2fdb6546816a84e4081081a3ee23e99475210d4401dc22f9e4d7344ae6bcd399-features-uid   latest      384c241022f4   9 seconds ago    4.78GB
vsc-sage-monorepo-2fdb6546816a84e4081081a3ee23e99475210d4401dc22f9e4d7344ae6bcd399-features       latest      1414fd24cbc7   38 seconds ago   4.06GB

caching new Docker folder

The post caching step says:

Warning: EACCES: permission denied, scandir '/var/lib/docker'

Docker driver & caching

Based on #1750, we can not cache docker images as long as we have projects that build images based on local images.

tschaffter commented 10 months ago

Still need to find where image layer are store

devcontainer up --cache-from /tmp/.buildx-cache --workspace-folder ../sage-monorepo

does not save data to /tmp/.buildx-cache even if the folder is created beforehand.

Trying to cache /var/lib/docker/buildkit

Post docker cache says:

Warning: EACCES: permission denied, lstat '/var/lib/docker/buildkit'

tschaffter commented 10 months ago

We may need the new option --cache-to recently added: https://github.com/devcontainers/cli/pull/570

I will resume working on caching the dev container image(s) when the dev container CLI v0.50.3 is released as it should include the option --cache-to.

https://github.com/devcontainers/cli/tags

tschaffter commented 10 months ago

What I learned from #1978

This was an attempt to make the GH runner run the entire job in the container with:

    runs-on: ubuntu-latest
    container:
      image: ghcr.io/sage-bionetworks/sage-devcontainer:55645b0
      options: --user root

~By default, GH runners are run as root.~

The default user used by the runner is:

Run id
uid=1001(runner) gid=123(docker) groups=123(docker),4(adm),101(systemd-journal)

On the other hand, Sage Monorepo environment has been designed to be executed by a non-root user (vscode). Trying to run the container as non-root (--user vscode) results in the checkout job failing:

node:internal/fs/utils:345
    throw err;
    ^

Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/save_state_feaa9342-be15-443b-927e-e3115f27f843'
    at Object.openSync (node:fs:585:3)
    at Object.writeFileSync (node:fs:2170:35)
    at Object.appendFileSync (node:fs:2232:6)
    at Object.issueFileCommand (/__w/_actions/actions/checkout/v3/dist/index.js:2945:8)
    at Object.saveState (/__w/_actions/actions/checkout/v3/dist/index.js:2862:31)
    at Object.8647 (/__w/_actions/actions/checkout/v3/dist/index.js:2321:10)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v3/dist/index.js:18251:43)
    at Object.2565 (/__w/_actions/actions/checkout/v3/dist/index.js:146:34)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v3/dist/index.js:18251:43)
    at Object.9210 (/__w/_actions/actions/checkout/v3/dist/index.js:1141:36) {
  errno: -13,
  syscall: 'open',
  code: 'EACCES',
  path: '/__w/_temp/_runner_file_commands/save_state_feaa9342-be15-443b-927e-e3115f27f843'
}

See this threads:

The id of the user and the permission of the folder mounted by the runner that leads the checkout step to fails. This issue is summarized here.

    steps:
      - name: Check id
        run: |
          id
          sudo ls -al /__w/_temp/

Run id
uid=1000(vscode) gid=1001(vscode) groups=1001(vscode),27(sudo),1000(docker)
total 24
drwxr-xr-x 5 1001  123 4096 Aug 17 03:55 .
drwxr-xr-x 6 1001 root 4096 Aug 17 03:54 ..
-rw-r--r-- 1 1001  123   27 Aug 17 03:55 27042907-d87a-4624-9b13-597a25316578.sh
drwxr-xr-x 2 1001  123 4096 Aug 17 03:55 _github_home
drwxr-xr-x 2 1001  123 4096 Aug 17 03:55 _github_workflow
drwxr-xr-x 2 1001  123 4096 Aug 17 03:55 _runner_file_commands

About the default runner user:

Checking out a repository using actions/checkout@v2 works for me, but only if I switch to a user with sufficient privileges for the default directory, for example root or 1001 (the user used by GitHub Actions):

Source

I found that setting my containers to run as the same UID/GID as my GHA runner user on the host solved the issue.

Source

tschaffter commented 10 months ago

New Approach: Mount Yarn cache folder to Dev Container in CI workflow

Returning to the current implementation where we use the devcontainer CLI. The goal is to set up the yarn cache folder and share it with the dev container. The tricky part is the permission because the GH runner who owns the cache folder and the user that run in the dev container are different.

This is how the cache is usually setup:

      - name: Get yarn cache directory path
        id: yarn-cache-dir-path
        run: echo "dir=$(yarn config get cacheFolder)" >> $GITHUB_OUTPUT

      - uses: actions/cache@v3
        id: yarn-cache # use this to check for `cache-hit` (`steps.yarn-cache.outputs.cache-hit != 'true'`)
        with:
          path: ${{ steps.yarn-cache-dir-path.outputs.dir }}
          key: ${{ runner.os }}-yarn-${{ hashFiles('**/yarn.lock') }}
          restore-keys: |
            ${{ runner.os }}-yarn-

Source

We can't perform the first step because Yarn is not installed on the OS used by the GH runner. Instead, we have Yarn in the dev container. Running the command in the dev container returns:

vscode@34f14f659357:/workspaces/sage-monorepo$ yarn config get cacheFolder
/workspaces/sage-monorepo/.yarn/cache

Content of the cache folder shows that the files are owned runner:docker.

Run ls -al /home/runner/work/sage-monorepo/sage-monorepo/.yarn/cache
total 374676
drwxr-xr-x 2 runner docker   311296 Aug  4 23:27 .
drwxr-xr-x 5 runner docker     4096 Aug 17 16:09 ..
-rw-r--r-- 1 runner docker       26 Jul  5 17:18 .gitignore
-rw-r--r-- 1 runner docker     4355 Jul  5 17:18 2-thenable-npm-1.0.0-3c202a902b-567cda6fb2.zip
-rw-r--r-- 1 runner docker     5659 Jul 20 00:43 @aashutoshrathi-word-wrap-npm-1.2.6-5b1d95e487-ada901b9e7.zip
-rw-r--r-- 1 runner docker    18024 Jul  5 17:18 @actions-exec-npm-1.1.1-90973d2f96-d976e66dd5.zip
-rw-r--r-- 1 runner docker    12407 Jul  7 22:04 @actions-github-npm-5.1.1-61d3d8cdac-2210bd7f8e.zip
...

Looking into the cache folder from within the container:

      - name: ls yarn cache folder inside the dev container
        run: |
          devcontainer exec --workspace-folder ../sage-monorepo bash -c ". ./dev-env.sh \
            && ls -al /workspaces/sage-monorepo/.yarn/cache"

Output:

 total 374676
drwxr-xr-x 2 vscode vscode   311296 Aug  4 23:27 .
drwxr-xr-x 5 vscode vscode     4096 Aug 17 16:20 ..
-rw-r--r-- 1 vscode vscode     4355 Jul  5 17:18 2-thenable-npm-1.0.0-3c202a902b-567cda6fb2.zip
-rw-r--r-- 1 vscode vscode     5659 Jul 20 00:43 @aashutoshrathi-word-wrap-npm-1.2.6-5b1d95e487-ada901b9e7.zip
-rw-r--r-- 1 vscode vscode     6283 Jul  5 17:20 abab-npm-2.0.6-2662fba7f0-6ffc1af4ff.zip
-rw-r--r-- 1 vscode vscode     2938 Jul  5 17:20 abbrev-npm-1.1.1-3659247eab-a4a97ec07d.zip
-rw-r--r-- 1 vscode vscode    25232 Jul  5 17:20 abort-controller-npm-3.0.0-2f3a9a2bcb-170bdba9b4.zip
-rw-r--r-- 1 vscode vscode     6503 Jul  5 17:20 accepts-npm-1.3.8-9a812371c9-50c43d32e7.zip
tschaffter commented 10 months ago

Cache when workflow run for PR vs push to main

The caches created by actions/cache@v3 when running the CI workflow in a PR is not available to main.

This makes sense since doing otherwise would enable a third-party person to affect the workflows running on main (e.g. voluntary cache poisoning or degrading performance by filling up the cache (max 10 GB)).

I observed this as the workflow running for main had the same cache key for Poetry as the key used when running the workflow for a PR, however the cache could not be found for main. I rerun the workflow on main and that time the cache was available. The cache was created during the first run on main.

tschaffter commented 9 months ago

Update 2023-10-03

The remaining tasks are to review Nx cloud config, though I think that it's working as expected. The second task is the adoption of pnpm, which should be more thoroughly tested.

tschaffter commented 9 months ago

Added to Sprint 23.10