DOMjudge / domjudge-packaging

DOMjudge packaging for (Linux) distributions and live image
31 stars 37 forks source link

Dockerised judgehost crashes on fork bomb submission #13

Closed greg closed 3 years ago

greg commented 6 years ago

I start the judgehost with

docker run -it --privileged -v /sys/fs/cgroup:/sys/fs/cgroup:ro --name judgehost-0 --link domserver:domserver --hostname judgedaemon-0 -e DAEMON_ID=0 -e DOMSERVER_BASEURL="<redacted>" -e JUDGEDAEMON_PASSWORD=<redacted> domjudge/judgehost:latest

and run (in another shell)

docker exec -it judgehost-0 /opt/domjudge/judgehost/bin/create_cgroups

as in #11. The executables are all as provided, no modifications. I then submit the C solution (to hello world in the demo contest):

#include <unistd.h>

int main( )
{
    while(1) {
        fork();
    }
    return 0;
}

and this happens:

[Mar 22 06:55:34.449] judgedaemon[29]: Judging submission s13 (endpoint default) (t5164706/p1/c), id j42...
[Mar 22 06:55:34.842] judgedaemon[29]: Working directory: /opt/domjudge/judgehost/judgings/judgedaemon-0-0/endpoint-default/c1-s13-j42
[Mar 22 06:55:35.166] judgedaemon[29]: executing chroot script: 'chroot-startstop.sh start'
[Mar 22 06:55:41.586] testcase_run.sh[803]: error: found processes still running as 'domjudge-run-0', check manually:
  863 program <defunct>
  864 program <defunct>
  865 program <defunct>
  866 program <defunct>
  867 program <defunct>
  868 program <defunct>
  869 program <defunct>
  870 program <defunct>
  871 program <defunct>
  872 program <defunct>
  873 program <defunct>
  874 program <defunct>
  875 program <defunct>
  876 program <defunct>
  877 program <defunct>
  878 program <defunct>
  879 program <defunct>
  880 program <defunct>
  881 program <defunct>
  882 program <defunct>
  883 program <defunct>
  884 program <defunct>
  885 program <defunct>
  886 program <defunct>
  887 program <defunct>
  888 program <defunct>
  889 program <defunct>
  890 program <defunct>
  891 program <defunct>
  892 program <defunct>
  893 program <defunct>
  894 program <defunct>
  895 program <defunct>
  896 program <defunct>
  897 program <defunct>
  898 program <defunct>
  899 program <defunct>
  900 program <defunct>
  901 program <defunct>
  902 program <defunct>
  903 program <defunct>
  904 program <defunct>
  905 program <defunct>
  906 program <defunct>
  907 program <defunct>
  908 program <defunct>
  909 program <defunct>
  910 program <defunct>
  911 program <defunct>
  912 program <defunct>
  913 program <defunct>
  914 program <defunct>
  915 program <defunct>
  916 program <defunct>
  917 program <defunct>
  918 program <defunct>
  919 program <defunct>
  920 program <defunct>
  921 program <defunct>
  922 program <defunct>
  923 program <defunct>
  924 program <defunct>
  925 program <defunct>
[Mar 22 06:55:41.596] judgedaemon[29]: error: Unknown exitcode from testcase_run.sh for s13, testcase 1: 127

the container exits and the judgehost is down. shouldn't runguard protect against stuff like this?

nickygerritsen commented 6 years ago

It should use runguard indeed, so I'm not sure. @meisterT do you have any clue?

meisterT commented 6 years ago

/cc @eldering

I don't know why this happens, I just tried on my local machine (without using docker) and runguard did limit the number of processes as expected. What value do you have configured in "Process limit"?

greg commented 6 years ago

The default process limit is 64, I didn't realise that was an available setting. Should I set it to 1 or 0?

eldering commented 6 years ago

No, don't lower it that much: that limit includes the shell script, and potentially other programs that wrap the actual solution (e.g. the JVM with multiple threads) so lowering it to anything below 5-10 puts you at risk of random crashes because of running out of threads. Runguard should enforce that limit, and should kill any processes after the judging run is done, so the error found processes still running is definitely a sign that something is not functioning as expected. I have no experience with Docker, so it's difficult for me to say exactly what is wrong. Although seeing that those processes are all "defunct" (i.e. zombie processes, see https://en.wikipedia.org/wiki/Zombie_process), maybe they don't get properly orphaned by init (pid=1) inside the container?

vmcj commented 3 years ago

@greg can you still reproduce this issue? I would like to fix this issue for you and possible other users which have this issue.

And if you can reproduce it, do you get the same behaviour when you run the judgecontainer interactive?

uint0 commented 3 years ago

We've been running into similar issues with forked processes and have a potential solution. For reference we are submitting the following python script which forks < 64 times. This produces the same output as described above. It's interesting to note that this issue doesn't exist if you execute bash in the container and then manually run start.sh.

import os
for i in range(10):
    if os.fork() <= 0:
        break

Here's what I think is happening:

  1. The docker container starts and runs /scripts/start.sh as PID 1
  2. start.sh execs judgehost, hence judgehost assumes PID 1
  3. A submission is made and runguard executes the program along with its forks creating children
  4. The children aren't waited on appropriately (This step is guesswork, but regardless zombie children are produced)
  5. runguard exits, so the zombie children are assigned to PID 1
  6. judgehost/php doesn't reap children as it isn't designed to have random children assigned to it
  7. testcase_run.sh discovers existing processes and exits

The solution for this is to use a init system like dumb-init (see PR #83) or figure out why the zombie processes are being spawned.

vmcj commented 3 years ago

The PR https://github.com/DOMjudge/domjudge-packaging/pull/83 indeed solved the issue.