Testing zig solutions is slow on Exercism

ee7 commented 1 year ago

https://github.com/exercism/zig-test-runner/commit/82b56afe82002cb6342222734b6bdeb8b665734c is a significant speedup for a RUN zig test foo command that's added at the bottom of the Dockerfile. But it doesn't seem to be a speedup in production.

Some possible short-term ways to improve:

Ensure that Zig can fully use the cache. Maybe some combination of:
- [x] Set USER (done in https://github.com/exercism/zig-test-runner/commit/b8b9d0221601474df58fd793d85d5fc5897fd837)
- [x] Create the cache later (done in https://github.com/exercism/zig-test-runner/commit/3f2c0c29aefb8fcf4049f80ba11ef801a9d01741)
- [x] Create cache via bin/run.sh, not zig test directly (done in https://github.com/exercism/zig-test-runner/commit/063642036853d548a3daa0fd53ecac53492942a7)
- [x] Specify -target (done in https://github.com/exercism/zig-test-runner/commit/b58066f952603c8d20bd61e3f61bba81a47415ec)
- [ ] Change some paths of solution files
- [ ] Look closer at the how Exercism run the container
- [ ] Set zig cache environment variables, or pass zig cache location options to zig test
Put the cache on a tmpfs.
Update Zig. Maybe commits like https://github.com/ziglang/zig/commit/020105d0dde614538a5839ede697e63a43bf6aa6 could help.

Longer term:

Use a bigger Zig cache.
Wait for https://www.github.com/ziglang/zig/issues/16270.
Use the x86_64 backend (currently marked as experimental).
Use wasm, like https://playground.zigtools.org/ and https://github.com/zigtools/playground. Exercism doesn't support this for now.
Wait for perf improvements in https://www.github.com/ziglang/zig/projects/6
Consider trying the C backend (no longer marked as experimental), and compiling with a fast C compiler. Note that the issue for generating tcc-compatible code (ziglang/zig#13576) is currently marked for Zig 1.1.0. Aside: the Nim test runner currently compiles Nim to C, and then uses tcc.

ee7 commented 1 year ago

@ErikSchierboom is this and this the current and complete source of truth for how Exercism runs a container in production?

ErikSchierboom commented 1 year ago

Yes, that is correct.

ee7 commented 1 year ago

It looks like https://github.com/exercism/zig-test-runner/commit/b58066f952603c8d20bd61e3f61bba81a47415ec helped. See https://github.com/exercism/zig-test-runner/pull/71#issuecomment-1694737076.

Next I'll try to compile more of the zig stdlib and add it to the image's zig cache. (Edit: that doesn't seem to produce a significant speedup).

ee7 commented 1 year ago

I still don't understand why if I locally:

Copy a valid solution acronym.zig and its test_acronym.zig to the directory foo in the root dir of this repo
Run bin/run-in-docker.sh acronym foo foo

it takes about 3 seconds locally on a slow machine (definitely using the container's zig cache), but it takes Exercism about 7-8 seconds to run the tests for acronym from the online editor. The latter time is roughly how long step 2 takes locally if I comment out the zig cache creation steps in the Dockerfile. So either:

Exercism still doesn't use (or fully use) the zig cache
or the delay is due to the container startup time. But I though we were getting 2-3 second test run times in production for interpreted languages in the past.

Any ideas?

ErikSchierboom commented 1 year ago

It must be the container startup time, because https://github.com/exercism/zig-test-runner/actions/runs/5992634417/job/16252228546 indicates a successful deploy, which is I can verify by going to https://exercism.org/maintaining/tracks/zig.

ee7 commented 1 year ago

indicates a successful deploy, which is I can verify by going to https://exercism.org/maintaining/tracks/zig.

Yeah, I know about this. But it's possible for bin/run-in-docker.sh acronym foo foo to be fast locally, but slow in production, without it being entirely due to the production container startup time.

For example, it seemed like https://github.com/exercism/zig-test-runner/commit/b58066f952603c8d20bd61e3f61bba81a47415ec really was a consistent 2x speedup over the immediately preceding state. I believe that Zig builds for the native CPU by default (even in debug mode, with the rationale that otherwise it'd give up performance when running debug builds, which is sometimes important), like passing -march=native for C. We have to explicitly specify something like -target x86_64-linux-musl or -mcpu baseline to assume only the minimum available CPU features.

Locally, the CPU at execution time was the same as the CPU at build time, so Zig could use the image's cache. But that wasn't true for the image that was deployed in production, because the GitHub Actions CPU features in general won't match those of the AWS machine.

So I'm trying to remove other possible differences in production that stop Zig from fully using the image's cache. For example, I noticed that the cache manifest contains inodes, but I haven't checked yet whether that means the cache cannot be used if transferred to another filesystem at the same path, or whether there's an extra cost if you do that. But it's designed to avoid absolute paths, at least.

I'll ask some Zig people if I can't find out myself.

ErikSchierboom commented 1 year ago

Locally, the CPU at execution time was the same as the CPU at build time, so Zig could use the image's cache. But that wasn't true for the image that was deployed in production, because the GitHub Actions CPU features in general won't match those of the AWS machine.

Is there anything I can run in production that will figure out the right architecture?

ee7 commented 1 year ago

Is there anything I can run in production that will figure out the right architecture?

Yes, but:

I don't want to assume anything about the CPU features of the AWS machine. That probably varies a lot.
Even if we could specify the exact CPUs we want for our AWS machines, for our use case, I don't think we'd get a meaningful speedup from lowering to non-baseline CPU instructions.

exercism / zig-test-runner

Testing zig solutions is slow on Exercism #63