JuliaCI / PkgEval.jl

Keeping tabs on the julia ecosystem
Other
28 stars 18 forks source link

Switch from Sandbox.jl to OCI runtime (crun) #177

Closed maleadt closed 1 year ago

maleadt commented 1 year ago

Adding features to Sandbox.jl has been a bit of a chore because that package aims to be compatible with Docker.jl, which is a great idea for Sandbox.jl, but makes it harder to do some more low-level things that don't generalize as well (e.g. https://github.com/staticfloat/Sandbox.jl/pull/106, https://github.com/staticfloat/Sandbox.jl/pull/67). In this PR, I'm switching to using a container runtime (crun) directly. That's pretty easy nowadays, just generate an appropriate config.json (not unlike Sandbox.jl's SandboxConfig) and call crun run. This should make it easier to address https://github.com/JuliaCI/PkgEval.jl/issues/158, implement resource limits, add GPU support, etc.

This PR is now feature complete, and at feature parity with Sandbox.jl (the only exception is that the rootfs isn't writable anymore, only select directories). It also uses the new control to fix the package cache and mount .julia/{packages,artifacts,compiled} as overlayfs mounts that each container can freely modify. This should fix https://github.com/JuliaCI/PkgEval.jl/issues/158.

DilumAluthge commented 1 year ago

Just out of curiosity:

  1. Can we still use the same rootfs images?
  2. Would it be useful to eventually factor this functionality out into a separate package?
maleadt commented 1 year ago

Can we still use the same rootfs images?

Yes

Would it be useful to eventually factor this functionality out into a separate package?

Even better, it could just be added as a back-end (Executor) to Sandbox.jl. My experimentation here forms a good basis for that, although I'll likely keep using the OCI runtime directly instead of moving back to Sandbox.jl (for more control over the sandbox configuration).

maleadt commented 1 year ago

I think this is currently incompatible with amdci because it's still running kernel 5.4, so we'll have to wait until @vchuravy upgrades that system... I'll extract unrelated changes from this PR into separate ones to reduce the scope.

maleadt commented 1 year ago

Apart from https://github.com/containers/crun/issues/1088, this seems to work fine on amdci now too (even though those machines currently run 18.04, shipping 5.4, Canonical seems to have back-ported unprivileged userns & overlayfs support). I'll give this a spin on a full run tomorrow.

maleadt commented 1 year ago

Two similar runs, to compare:

I noticed that CPU usage was remarkably lower, ~75% istead of 99%. Yet the total duration is similar... I found this fishy, but the server logs confirm:

sandbox
Dec 05 05:47:53 amdci8 bash[114126]:       From worker 2:        Running tests: 4877 remaining
Dec 05 10:17:54 amdci8 bash[114126]:       From worker 2:        Running tests: 1 remaining (ETA: 0:02:10)
Dec 05 10:20:08 amdci8 bash[114126]:       From worker 2:        Removed 596 duplicate evaluations that resulted from retrying tests.

crun
Dec 06 03:05:58 amdci8 bash[94260]:       From worker 2:        Verifying artifacts...
Dec 06 03:08:00 amdci8 bash[94260]:       From worker 2:        Running tests: 4869 remaining
Dec 06 07:33:38 amdci8 bash[94260]:       From worker 2:        Running tests: 1 remaining (ETA: 0:42:39)
Dec 06 07:39:31 amdci8 bash[94260]:       From worker 2:        Removed 243 duplicate evaluations that resulted from retrying tests.

Also interesting is the significant increase in sequential duration, i.e., the time it takes to complete individual tests. Example:

I need to investigate what's happening here.

maleadt commented 1 year ago

Another test: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_hash/5060aed/report.html Looks much better -- some increase in execution time is expected because we now evaluate the cache every time. I optimized that code quite a bit though, is why the timing improved from 11 days to 9 and something.

maleadt commented 1 year ago

GitHub has changed how they support cgroups, and apparently have removed cpuset support. We'll need to improve detection of that for CI to work properly again.