Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

CI sometimes fails to read swingset config file #10092

Open gibson042 opened 4 days ago

gibson042 commented 4 days ago

Describe the bug

As seen at https://github.com/Agoric/agoric-sdk/actions/runs/10873209314/job/30169121954?pr=10091#step:5:1

Run cd packages/boot && yarn test | $TEST_COLLECT
  cd packages/boot && yarn test | $TEST_COLLECT
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    AGORIC_AVA_USE_TAP: true
    TEST_COLLECT: tee -a _testoutput.txt
    NODE_V8_COVERAGE: coverage
    GH_ENGINE: 18.x
    CI_NODE_INDEX: 1
    CI_NODE_TOTAL: 4
    ESM_DISABLE_CACHE: true
yarn run v1.22.22
$ ava
TAP version 13
failed to load bundles/decentral-itest-vaults-config.json
not ok 1 - net-ibc-upgrade › before hook %ava-dur=96ms
#   REJECTED from ava test.before(): (SyntaxError#1)
#   SyntaxError#1: Unexpected end of JSON input
#       at JSON.parse (<anonymous>)
#       at loadSwingsetConfigFile (/home/runner/work/agoric-sdk/agoric-sdk/packages/SwingSet/src/controller/initializeSwingset.js:239:25)
#       at async ensureSwingsetInitialized (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:146:18)
#       at async buildSwingset (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:218:32)
#       at makeSwingsetTestKit (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/tools/supports.ts:501:48)
#       at makeTestContext (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:37:27)
#       at <anonymous> (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:49:15)

I suspect this can be fixed by adding a flush: true option to the writeFile call in packages/boot/tools/supports.ts, but only after raising the Node.js version bar to at least v20.10.0 (cf. https://github.com/nodejs/node/pull/50009 for background). And if so, then we may want to consider using it in even more writeFiles. A possible alternative would be reading back the expected size, but that seems too heavyweight.

Subsequent conversation on Slack identified a more likely culprit as concurrently running tests writing and reading against the same file name, a hypothesis which was locally reproducible:

```console node --input-type=module -e ' import fs from "node:fs"; import fsp from "node:fs/promises"; const sink = () => {}; const delay = ms => new Promise(resolve => setTimeout(resolve, ms)); const waitFor = thunk => new Promise(async (resolve, reject) => { try { while (!(await thunk())) await delay(100); resolve(); } catch (err) { reject(err); } }); const path = "./dummy"; const data = Object.fromEntries( Array.from({ length: 10000 }, (_, i) => [`key${i.toString().padStart(6, "0")}`, i]), ); const fn = async worker => { for (let i = 0; i < 1000; i++) { await fsp.unlink(path).then(sink, sink); await waitFor(() => fs.statSync(path, { throwIfNoEntry: false }) === undefined); await fsp.writeFile(path, JSON.stringify(data), "utf-8"); try { JSON.parse(fs.readFileSync(path, "utf-8")); } catch (err) { if (err.code === "ENOENT") continue; throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err }); } } return; }; for (let i = 0; i < 2; i++) fn(i); ' file:///tmp/[eval1]:27 throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err }); ^ Error: worker 0 failed on attempt 43 at fn (file:///tmp/[eval1]:27:13) { [cause]: SyntaxError: Unexpected end of JSON input at JSON.parse () at fn (file:///tmp/[eval1]:24:12) } Node.js v18.18.2 ```

We should instead introduce a random component in these Swingset config file names.

mhofman commented 3 days ago

I am very confused by what https://github.com/nodejs/node/pull/50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

mhofman commented 3 days ago

From a conversation on Slack, it's a lot more likely that ava concurrency results in the same config file being written by concurrent tests, potentially causing tears since the file being written may take multiple write syscalls.

The solution is likely to add a random component to testConfigPath

siarhei-agoric commented 3 days ago

I am very confused by what nodejs/node#50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

There are several things to consider even when a single file is used by a single process on a single host:

  1. Linux kernel only guarantees consistent visibility of writes which are completed via kernel's syscall interface. However, there may be several other layers of abstraction, transformation, and caching between the fs interface in JS and the actual Kernel syscall. A flush from the highest-level interface would guarantee that all of the data actually makes it to its final backing store before the top-level execution proceeds further.
  2. Some applications (such as databases) often require strict ordering of separate writes to different places of the same file, or even groups of different files. This is typically required to guarantee atomicity of transactions (potentially across a crash/reboot).
  3. A set of different processes running on a same host may use a file as a mail box to send messages to each other. Just like in 1 and 2 above, an explicit flush would provide a mechanism to enforce completeness, ordering, and atomicity of these messages.
  4. An fsync on a file would force all of the other outstanding writes to the same file complete and return back with appropriate success/error codes.

With that in mind, only option 1 could be applicable to a simple config file write/read by a single process, and only in case of application crash/abort.