Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
23 stars 12 forks source link

[Bug] Fix broken `main` (October 2023) #2237

Closed tschaffter closed 11 months ago

tschaffter commented 11 months ago

Is there an existing issue for this?

What product(s) are you seeing the problem on?

Sage Monorepo

Current behavior

The main branch shows an error with the execution of the CI workflow. The error happens when building and publishing the Docker images. Rerunning the workflow did not solve the issue.

Expected behavior

No response

Anything else?

No response

Commit ID

6b7f7ac4591435371d04ef309cb379f2bd3ce836

Are you developing inside the dev container?

Code of Conduct

tschaffter commented 11 months ago

The issue appeared first when I updated Prettier. Since then it has affected all pushes to main.

Failed tasks:

   Failed tasks:

   - openchallenges-challenge-service:build-image-base
   - openchallenges-organization-service:build-image-base
   - openchallenges-image-service:build-image-base

For example:

BUILD FAILED in 11s
5 actionable tasks: 2 executed, 3 up-to-date
Failed to execute command: ./gradlew bootBuildImage 
Error: Command failed: ./gradlew bootBuildImage 
    at checkExecSyncError (node:child_process:885:11)
    at execSync (node:child_process:957:15)
    at runBuilderCommand (/workspaces/sage-monorepo/node_modules/@nxrocks/common/src/lib/core/jvm/utils.js:20:38)
    at runBootPluginCommand (/workspaces/sage-monorepo/node_modules/@nxrocks/nx-spring-boot/src/utils/boot-utils.js:18:43)
    at /workspaces/sage-monorepo/node_modules/@nxrocks/nx-spring-boot/src/executors/build-image/executor.js:10:54
    at Generator.next (<anonymous>)
    at /workspaces/sage-monorepo/node_modules/tslib/tslib.js:118:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/workspaces/sage-monorepo/node_modules/tslib/tslib.js:114:16)
    at buildImageExecutor (/workspaces/sage-monorepo/node_modules/@nxrocks/nx-spring-boot/src/executors/build-image/executor.js:8:20)
    at /workspaces/sage-monorepo/node_modules/nx/src/command-line/run/run.js:81:23
    at Generator.next (<anonymous>)
    at fulfilled (/workspaces/sage-monorepo/node_modules/nx/node_modules/tslib/tslib.js:166:62)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

I rerun the latest commit to main today and this time the image of the image service built fine.

   - openchallenges-organization-service:publish-and-remove-image
   - openchallenges-image-service:publish-and-remove-image
   - openchallenges-organization-service:publish-image
   - openchallenges-organization-service:build-image
   - openchallenges-image-service:publish-image
   - openchallenges-image-service:build-image

   Failed tasks:

   - openchallenges-organization-service:build-image-base
   - openchallenges-image-service:build-image-base
tschaffter commented 11 months ago

One particularity of updating Prettier is that this triggered the tasks for all the projects in the monorepo. Once side effect that could cause the error above is that the storage space available to the CI workflow was not enough.

But why did that impacted only the images of the three microservices?

tschaffter commented 11 months ago

There is also this error related to the images:

Deleted: sha256:0a4e87eff9269728a61abf3225455f49ecd3ea06c22cf0574a839fab7af80e89
Untagged: ghcr.io/sage-bionetworks/openchallenges-app:local
Deleted: sha256:3960d934632e06eec7d88ef8f7143b144036432a05bd8b88d0cb7d700ced4b3a
Error response from daemon: No such image: 0a4e87eff926:latest
tschaffter commented 11 months ago

I can reproduce the issue locally:

vscode@52f527f259e0:/workspaces/sage-monorepo$ nx run-many --target=build-and-remove-image \
            --projects=openchallenges-app,openchallenges-challenge-service,openchallenges-organization-service,openchallenges-image-service,openchallenges-api-gateway,schematic-api \
              --parallel=1

    ✔  nx run openchallenges-api-description:build-individuals (1s)
    ✔  nx run openchallenges-api-description:build  [local cache]
    ✔  nx run openchallenges-app-config-data:build  [local cache]
    ✔  nx run openchallenges-app-config-data:install (2s)
    ✔  nx run openchallenges-api-client-angular:build:production  [local cache]
    ✔  nx run shared-java-util:build  [local cache]
    ✔  nx run shared-java-util:install (3s)
    ✔  nx run openchallenges-api-gateway:build-image-base (32s)
    ✔  nx run openchallenges-api-gateway:build-image (2s)
    ✔  nx run schematic-api:build-image (2m)
    ✔  nx run openchallenges-app:build:production  [local cache]
    ✔  nx run openchallenges-challenge-service:build-image-base (28s)
    ✔  nx run openchallenges-organization-service:build-image-base (28s)
    ✔  nx run openchallenges-image-service:build-image-base (23s)
    ✔  nx run openchallenges-app:server:production  [local cache]
    ✔  nx run openchallenges-challenge-service:build-image (2s)
    ✔  nx run openchallenges-organization-service:build-image (2s)
    ✔  nx run openchallenges-image-service:build-image (2s)
    ✔  nx run openchallenges-app:build-image (4s)
    ✔  nx run openchallenges-api-gateway:build-and-remove-image (369ms)
    ✔  nx run schematic-api:build-and-remove-image (336ms)
    ✔  nx run openchallenges-challenge-service:build-and-remove-image (322ms)

    ✖  nx run openchallenges-organization-service:build-and-remove-image
       Untagged: ghcr.io/sage-bionetworks/openchallenges-organization-service:edge
       Untagged: ghcr.io/sage-bionetworks/openchallenges-organization-service:local
       Untagged: ghcr.io/sage-bionetworks/openchallenges-organization-service:sha-6b7f7ac
       Deleted: sha256:eba522cc7f3a06480fa78ebd5e67e273eceeec3b38661d5705ca0c67cac17239
       Error response from daemon: conflict: unable to delete 483f7a79fa23 (cannot be forced) - image is being used by running container 4a3dcfe95c83
       Error response from daemon: No such image: eba522cc7f3a:latest
       Error response from daemon: No such image: eba522cc7f3a:latest
       Warning: run-commands command "docker rmi $(docker images --filter=reference=ghcr.io/sage-bionetworks/openchallenges-organization-service:* --quiet) --force" exited with non-zero status code

    ✖  nx run openchallenges-image-service:build-and-remove-image
       Untagged: ghcr.io/sage-bionetworks/openchallenges-image-service:edge
       Untagged: ghcr.io/sage-bionetworks/openchallenges-image-service:local
       Untagged: ghcr.io/sage-bionetworks/openchallenges-image-service:sha-6b7f7ac
       Deleted: sha256:c652c92493cc101c8c19b7ce9744c8bb2bb2c600f82b977f4775f9480ea4142b
       Error response from daemon: No such image: c652c92493cc:latest
       Error response from daemon: No such image: c652c92493cc:latest
       Error response from daemon: conflict: unable to delete 84e1b573a61a (cannot be forced) - image is being used by running container 239dc1c2049b
       Warning: run-commands command "docker rmi $(docker images --filter=reference=ghcr.io/sage-bionetworks/openchallenges-image-service:* --quiet) --force" exited with non-zero status code

    ✔  nx run openchallenges-app:build-and-remove-image (343ms)

 —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 >  NX   Ran target build-and-remove-image for 6 projects and 19 tasks they depend on (4m)

    ✔    23/25 succeeded [6 read from cache]

    ✖    2/25 targets failed, including the following:
         - nx run openchallenges-organization-service:build-and-remove-image
         - nx run openchallenges-image-service:build-and-remove-image

   View structured, searchable error logs at https://cloud.nx.app/runs/Zz6VHlKRB9

EDIT: The error is different on my end: it was because I was running containers that use the images and so deleting the images failed.

tschaffter commented 11 months ago

As I expected, the issue is because there are only 4 GB left before building the images. I really need to come up with a robust solution to this problem tracked in an existing ticket.

Run df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   80G  4.0G  96% /
tmpfs           3.4G  172K  3.4G   1% /dev/shm
tmpfs           1.4G  1.2M  1.4G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  31% /mnt
tmpfs           693M   12K  693M   1% /run/user/1001
tschaffter commented 11 months ago

We are using the standard GH runner but it looks like we have access to a bigger one.

image

See https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners

It looks like the larger runner can be used even for job triggered by PR from fork hosted outside of the Sage GH org.

image

For reference, here is the initial storage space before the commit is checkout:

image
tschaffter commented 11 months ago

Runtime comparison between default GH runner and larger runner

Here the runtime is for applying the tasks to ALL the projects in the monorepo, not just the affected tasks. In most cases, the number of projects affected by a PR will be smaller and so the CI workflow will complete faster.

Default GitHub-hosted runner

ubuntu-latest:

image

The workflow "completes" in 28 minutes. Two tasks failed but wouldn't otherwise take much extra time.

Larger runner

This is the only larger runner that Sage currently makes available: ubuntu-22.04-4core-16GBRAM-150GBSSD

image

The workflow completes in 16 minutes!

tschaffter commented 11 months ago

I contacted IT and we will review the amount billed by the larger runner at this end of this month.