Open meisterT opened 3 months ago
There are three TODOs that I need to think a little bit more about, but in general it is ready for review (and works both on my laptop, a desktop and on CI).
We lose some of the other integration tests, can you leave the gitlab configs for now so we can add those back in the future?
Happy to revert what needs to be reverted but I fail to see what we are losing.
You checked if the cgroupv1 also still worked?
Locally: yes.
We lose some of the other integration tests, can you leave the gitlab configs for now so we can add those back in the future?
Happy to revert what needs to be reverted but I fail to see what we are losing.
The GitLab version tested:
I think not all were copied over and would require a matrix job.
Since judging was basically disabled on gitlab we didn't get a lot of use from the matrix
Since judging was basically disabled on gitlab we didn't get a lot of use from the matrix
I agree, but I think those tests found errors in the past. But if you think they're not needed let's leave it like this.
There are three TODOs that I need to think a little bit more about, but in general it is ready for review (and works both on my laptop, a desktop and on CI).
The most important part is testing this PR to figure out where it works and where it doesn't.
@nickygerritsen already tested on Mac plus docker and we have other people who committed to testing on Debian stable, fedora and Arch.
I tried to run this on my Arch Linux laptop using docker compose up
, but I got the same error as when running on the main
branch:
domjudge-1 | Error: cgroup support missing memory features in running kernel. Unable to continue.
domjudge-1 | To fix this, please make the following changes:
domjudge-1 | 1. In /etc/default/grub, add 'cgroup_enable=memory swapaccount=1' to GRUB_CMDLINE_LINUX_DEFAULT.
domjudge-1 | On modern distros (e.g. Debian bullseye and Ubuntu Jammy Jellyfish) which have cgroup v2 enabled by default,
domjudge-1 | you need to add 'systemd.unified_cgroup_hierarchy=0' as well.
domjudge-1 | 2. Run update-grub
domjudge-1 | 3. Reboot
However, I do have cgroup_enable=memory swapaccount=1
currently in my /proc/cmdline
.
It looks like I cannot run cgroups v2 without also having support for v1, unless I'm missing some configuration setting somewhere :stuck_out_tongue: I'm guessing that judge/create_cgroups.in
should be updated as well? The error message also appears to be outdated now regarding the "modern distros" part :smile:
The create_cgroups script should not be needed. I commented out the whole thing on my docker setup and that worked. What error is the judgedaemon giving you?
In my docker compose up
log, I got:
domjudge-1 | 2024-09-26 08:09:29,624 INFO spawned: 'judgedaemon0' with pid 1956
domjudge-1 | 2024-09-26 08:09:29,628 INFO spawned: 'judgedaemon1' with pid 1957
domjudge-1 | 2024-09-26 08:09:30,217 WARN exited: judgedaemon0 (exit status 1; not expected)
domjudge-1 | 2024-09-26 08:09:30,217 WARN exited: judgedaemon1 (exit status 1; not expected)
[... repeats three times, followed by ...]
domjudge-1 | 2024-09-26 08:09:36,815 INFO gave up: judgedaemon0 entered FATAL state, too many start retries too quickly
domjudge-1 | 2024-09-26 08:09:36,816 INFO gave up: judgedaemon1 entered FATAL state, too many start retries too quickly
However, when looking in ./output/log/judge.domjudge-contributor-0.log
, it appeared that my judgehost password was incorrect, because I had a DB dump loaded from a previous contest. :innocent: So I backed up my current DB volume did a docker compose pull
and docker compose rm -v
(removing volumes) to re-initialize the database. Now, the judgehosts do start (indeed, with commenting out judge/create_cgroups.in
) :smile:
@mpsijm thanks for testing, can you add cgroup_set_default_logger(CGROUP_LOG_DEBUG);
to the main
in runguard.cc
and try again?
I did some initial testing and can reproduce the debug output in a container. As far as I understand, moving the process out of the container slice is not permitted. When mounting the complete cgroupfs, the processes are already running in a cgroup:
# grep -r $(echo $$) /sys/fs/cgroup --include 'cgroup.procs'
/sys/fs/cgroup/machine.slice/machine-libpod_pod_eb6f870aeb47c7f68c39940ee1841f7c1503844369812ca28001ba2180b90a5a.slice/libpod-4be6c5ad0c860471cc3a64fb8b37e30d67c9cc03b844508bceffeb01187ef6a5.scope/container/cgroup.procs:2474
Manually creating a domjudge
group under the container slice and moving spawned processes in it is permitted from within the container, but it won't allow moving processes outside of the slice.
Processes in the container think that they are in the cgroup:
$ cat /proc/self/cgroup
0::/
If we do not mount /sys/fs/cgroup
in the container, the correct slice is mounted on /sys/fs/cgroup
, but we are not permitted to create subgroups in it. We can see that a spawned process is in this virtual root group.
If you are running docker, you might need to pass --privileged --cgroupns=host --init
(or a subset of it) as we do in the github integration test.
If you are running docker, you might need to pass
--privileged --cgroupns=host --init
(or a subset of it) as we do in the github integration test.
The (podman in my setup) container is already running privileged, and I tried using the host cgroupns but that did not solve it.
I've added cgroup: host
and init: true
to the domjudge
service in docker-compose.yml
, and that works! :partying_face: It also works when only adding cgroup: host
. It does not work when only adding init: true
. The privileged: true
option was already there.
I've checked for some submissions that should time out that they do, and for submissions that use too much memory that they are terminated, looks like it works :smile: I haven't tested the scenario where the compiler uses too much memory or times out, because I wouldn't know how to :joy:
So we need to test if cgroup: host
also works on legacy v1 systems and on macOS and if so make that the default I'd say
This is just to test whether it works on CI, it needs a lot of cleanup.