anchore / scan-action

Anchore container analysis and scan provided as a GitHub Action
MIT License
202 stars 75 forks source link

flaky tests in GitHub actions #265

Closed willmurphyscode closed 2 months ago

willmurphyscode commented 8 months ago

The unit tests for this repo sometimes fail with an error like this:

spawn ETXTBSY

      at ToolRunner.<anonymous> (node_modules/@actions/exec/src/toolrunner.ts:443:24)
      at node_modules/@actions/exec/lib/toolrunner.js:27:71
      at Object.<anonymous>.__awaiter (node_modules/@actions/exec/lib/toolrunner.js:23:12)
      at node_modules/@actions/exec/src/toolrunner.ts:419:58
      at ToolRunner.<anonymous> (node_modules/@actions/exec/src/toolrunner.ts:419:12)
      at fulfilled (node_modules/@actions/exec/lib/toolrunner.js:24:58)

(link)

I believe this is because the tests run simultaneously, but runGrype is not threadsafe.

https://github.com/anchore/scan-action/blob/52d017bdbe923afa39369bc0cb1c89ff7463ab54/index.js#L31-L35

This has a race condition, since whatever is present in the cache may be changed by one test while another test is checking it. I've also seen ENOENT in test runs.

willmurphyscode commented 8 months ago

We might be able to get away with just running one test at a time as a cheap way to fix this:

With maxConcurrency = 1, we see npm run test 11.77s user 6.00s system 33% cpu 53.132 total. But nearly 50 seconds of that is downloading the db (which only happens once regardless of test parallelism).

I'll see if maxConcurrency = 1 fixes this.

popey commented 2 months ago

Forgive the possibly ill-informed comment here. Would it be possible to pre-cache the grype-db in an early step before we kick off the further steps that may depend on it?

willmurphyscode commented 2 months ago

I think maybe we want to use this jest option to keep the tests from racing to install grype: https://jestjs.io/docs/cli#--runinband

@popey I believe the race condition occurs installing grype itself, not downloading the grype-db. Your suggestion still stands, and that might be the way forward, but I think given how quick the tests are, running the serially is probably a quicker fix and we'll try that first.

kzantow commented 2 months ago

I think it should already be set to run serially: https://github.com/anchore/scan-action/blob/main/jest.config.js#L4

willmurphyscode commented 2 months ago

Here's an example from sbom-action: https://github.com/anchore/sbom-action/actions/runs/9863167357/job/27235328478#step:6:246

willmurphyscode commented 2 months ago

https://github.com/anchore/scan-action/actions/runs/9864003440/job/27238006566#step:8:131 is another example of the spawn: ETXTBSY flake.

willmurphyscode commented 2 months ago

Another one in sbom-action: https://github.com/anchore/sbom-action/actions/runs/9872881542/job/27263932911?pr=475#step:6:239

willmurphyscode commented 2 months ago

Both scan-action and sbom-action have their tests running in series now. We can re-open this if the issue returns.