actions / runner

The Runner for GitHub Actions :rocket:
https://github.com/features/actions
MIT License
4.77k stars 932 forks source link

Job Interference using Reusable Workflows with Matrix Strategy #2475

Open asraa opened 1 year ago

asraa commented 1 year ago

Describe the bug Using reusable workflows with matrix strategies is now supported (thank you!). However, there seems to be a flakey bug where the jobs created by the reusable workflow interact in different ways. Let's say there is a matrix strategy creating two reusable workflow calls, BUILD_1 and BUILD_2. Then,

https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604490843 https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604492489

BUT their inputs to the steps where different and corresponded to the different invocations of the reusable workflow.

This was caused when clicking the UI for re-running the workflow when it failed.

Are these reusable workflow invocations truly isolated and running in different environments? We are particularly concerned because our project requires isolation in each reusable workflow environment in order to produce a build that satisfies isolation properties from other builds.

To Reproduce Steps to reproduce the behavior:

  1. Use a reusable workflow with matrix strategy, where one job requires another. Say, one job uploads an artifact and the second downloads.

    jobs:
    build_binary:
    # We use the same job template to generate provenances for multiple binaries.
    strategy:
      fail-fast: false
      matrix:
        buildconfig:
          - buildconfigs/slsav1_oak_functions_enclave_app.toml
          - buildconfigs/slsav1_oak_tensorflow_enclave_app.toml
    
    permissions:
      actions: read
      id-token: write
      contents: write
      pull-requests: write
    uses: ./.github/workflows/reusable_provenance.yaml
    with:
      build-config-path: ${{ matrix.buildconfig }}

Expected behavior We expect that each BUILD runs in separate environments, and the JOBs do not interact

Runner Version and Platform

Image: ubuntu-20.04 Version: 20230224.2

OS of the machine running the runner? Linux

What's not working?

A job appears twice in one build, but not in the other: unnamed

Job Log Output

Runner and Worker's Diagnostic Logs

https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604490843 https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604492489

cc @rbehjati @laurentsimon @ianlewis @jhutchings1

asraa commented 1 year ago

Update: the problem where the artifact is missing may be related to https://github.com/actions/upload-artifact/issues/389? In this case, the artifacts ARE uploaded to the UI.

rbehjati commented 1 year ago

Thanks for reporting this @asraa.

I think these are potentially two problems. One is an issue with the "matrix strategy" where the generate-build-definition step is run twice for one job, and never for the other.

The other issue is with the uploade-artifact action (or our usage of it). Here is another run, where the matrix strategy is replaced with two separate jobs each calling the same reusable workflow with a separate input. Downloading the artifact fails in this case too.

rbehjati commented 1 year ago

Note: If I remember correctly, the double run issue (in https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604641386) happened when I re-ran all the workflows (using the Re-run jobs button in the UI).

asraa commented 1 year ago

On the one where it is run twice, I noticed something really weird. Both of the jobs had the same full name (so they were both referencing the same matrix strategy input)

https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604490843 https://github.com/project-oak/oak/actions/runs/4347726368/jobs/7604492489

BUT their steps inputs where different and corresponded to the different invocations of the reusable workflow, which caused both the uploads to occur.

asraa commented 1 year ago

Update: I think we have debugged the upload issue. The double run issue still remains.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 15 days.

laurentsimon commented 6 months ago

Please do not close