Optimize `runpipe` - Githubissues

RagnarGrootKoerkamp commented 3 years ago

See https://github.com/DOMjudge/domjudge/blame/main/judge/runpipe.cc#L54. (CC @eldering @meisterT @mvr320 ).

When an interactive problem needs to pipe O(megabytes) of data (in one direction), this is done in chunks of 512 bytes. After each chunk, the stream in the other direction in checked with a 1ms timeout. Doing this O(1000) times results in wallclock timeouts without the submission being able to do any real compute.

Also note that it alternates between trying the pipes in the two directions.

I see a few possible solutions. Happy to implement one but not sure what is preferred:

Decrease the timeout from 1000us to 1us. Would this cause issues otherwise? (Now the script becomes a busy wait loop polling all inputs very often, consuming more CPU.)
Increase the buffer size. Currently the comment says it's the POSIX minimum value of pipes, but I'm not sure how this matters. (And I think the pipes are set to the max of 1MB anyway.) Reading/writing 1MB of data should be fine anyway right? If the team-input pipe fills up, that can happen either in 512byte chunks of 1MB chunks, but if it's filling up the chunk size doesn't matter anyway.
When the number of bytes read (nread) is 512 exactly, we can retry pumping from the same input, instead of switching to the pipe in the other direction.
Multithread the program, and do the two (or more) streams in parallel, so that blocking code can be used instead of non-blocking code with timeouts.

General remark: The whole file is hard to parse and understand because it's all pretty low level C. Maybe it could be rewritten to something newer? (Although maybe it's inevitable because of all the system calls, file descriptors, signal handling, ...)

For reference, in BAPCtools I do:

By default: create two pipes and connect the interactor and submission directly to them. No intercepting of the data through the pipe is happening.
If needed: Create a total of 4 pipes:
- interactor -> interceptor -> team
- team -> interceptor -> interactor I then launch two instances of this interceptor and pipe both their stderr to the same file to get a log of what's happening. The interceptor is just some very short python code that reads 1 char at a time and copies it to stdout and stderr.

meisterT commented 3 years ago

Just a quick comment before I dive too deep: can you try reading from the same pipe again if there was data?

RagnarGrootKoerkamp commented 3 years ago

Yeah I think this should work, see point 3. (Not sure whether the condition should 'read any data' or 'read a full buffer of data'. I don't know whether it's guaranteed the buffer will be full if there is enough data pending.)

eldering commented 3 years ago

I see a few possible solutions. Happy to implement one but not sure what is preferred:
1. Decrease the timeout from `1000us` to `1us`. Would this cause issues otherwise? (Now the script becomes a busy wait loop polling all inputs very often, consuming more CPU.)

I think it would be possible to reduce the timeout, as the runpipe program is executed outside the sandbox environment created by runguard, so it would not count towards CPU usage of the submission. However, it will contribute to wallclock time, and I don't know how very frequent polling might affect scheduling of tasks and somehow indirectly submission performance. Thus, I'd be careful and maybe try reducing it to something in the range of 10-100 usec first. A polling frequency of 10us would allow a transfer speed of 48 MB/s (excluding overhead), so that sounds like more than enough.

2. Increase the buffer size. Currently the comment says it's the POSIX minimum value of pipes, but I'm not sure how this matters. (And I think the pipes are set to the max of 1MB anyway.) Reading/writing 1MB of data should be fine anyway right? If the team-input pipe fills up, that can happen either in 512byte chunks of 1MB chunks, but if it's filling up the chunk size doesn't matter anyway.

I'm not sure why I used PIPE_BUF = 512, but I assume because man 7 pipe says that that's the maximum value that's guaranteed to be written atomically.

3. When the number of bytes read (`nread`) is `512` exactly, we can retry pumping from the same input, instead of switching to the pipe in the other direction.

That is definitely an option to try.

4. Multithread the program, and do the two (or more) streams in parallel, so that blocking code can be used instead of non-blocking code with timeouts.

I think this might be the best solution as it would make the code cleaner. It does mean overhauling the runpipe program a fair bit though and introduce more subtleties with having to open/close pipe FDs in the right processes and waiting for them. Also the way the code currently detects that the jury code has finished before the submission might become a bit more tricky.

General remark: The whole file is hard to parse and understand because it's all pretty low level C. Maybe it could be rewritten to something newer? (Although maybe it's inevitable because of all the system calls, file descriptors, signal handling, ...)

I think that doing this low-level pipe manipulation stuff properly would not be very easy to read in any language. My typical impression is that in other languages it may seem cleaner since these offer a simplified interface, while we do most likely need much of the intricate options. Also, I prefer C since the man-pages are very precise (even though also take effort to read).

For reference, in BAPCtools I do:

* By default: create two pipes and connect the interactor and submission directly to them. No intercepting of the data through the pipe is happening.

* If needed: Create a total of 4 pipes:

  * interactor -> interceptor -> team
  * team -> interceptor -> interactor
    I then [launch two instances of this interceptor](https://github.com/RagnarGrootKoerkamp/BAPCtools/blob/master/bin/interactive.py#L168-L197) and pipe both their `stderr` to the same file to get a log of what's happening. The interceptor is just some very short python code that reads 1 char at a time and copies it to stdout and stderr.

As a general remark, I think it would be good to write some more test sources for interactive problems of this form that try to exhibit all kinds of behaviours such as writing a lot of data, writing data byte-by-byte slowly, etc.

meisterT commented 2 years ago

@tuupke is working on option 4.

vmcj commented 2 years ago

@RagnarGrootKoerkamp runpipe is now replaced with a new implementation which we hope has better performance so I'll close this issue.

DOMjudge / domjudge

Optimize `runpipe` #1237