Peter Portante, May 2021
Based on the v2.0.28 (GitHub tag) code base
conmon uses non-blocking pipes for reading for the stdout & stderr pipes of a container
a. Pipes created at lines 163 [3] and 178 [4] of conmon.c using the pipe2() [5] system call, with the O_CLOEXEC flag
Pipes are 64 KB in size, the Linux default
Does Go change the size of the pipes dynamically?
b. Registers stdio_cb(), line 17 [1] of ctr_stdio.c for poll() events
In response to being invoked, stdio_cb() in turn calls read_stdio() [2], which reads from the pipe file descriptor using a STDIO_BUF_SIZE sized buffer (+2 as described at [7])
a. Where STDIO_BUF_SIZE is 8 KB [6], for a total of 8,192 bytes used in memory
b. [Flag] This means that 8 read system calls are required to drain the pipe
[Question] Does the Go integration change the pipe size to 8,192 bytes?
The read_stdio() only makes one read(2) system call
[Question] Why is only one read(2) system call made? Does that allow the polling loop to fairly round-robin between reading from stdout and stderr?
[Flag] We should consider having the read_stdio() continue reading until read(2) returns EWOULDBLOCK
[Flag] The STDIO_BUF_SIZE should be 64 KB to allow the most efficient use of system calls
We should even consider increasing the size to the maximum allowed on the system to further increase the efficiency, if we can implement https://github.com/cri-o/cri-o/pull/1605.
c. [Flag] This also has the effect that 8 KB reads increases the number of "partial" log records emitted by conmon, since the pipe can fill up while conmon is processing a read.
d. The read buffer is a stack local variable in read_stdio(), but the target address used in the read(2) system call is moved by one byte [7], which means any reads returning >= 4 KB of data will result in 2 pages being used on the stack, with a maximum of 3 pages used
[Flag] This is only used for the remote consoles case [8], which means you could avoid the extra page use having write_back_to_remote_consoles() write the extra byte only when remote consoles are present
d. When 1 or more bytes are returned by the read(2) system call in read_stdio():
write_to_logs() is invoked to write the logs to disk, which in turn calls write_journald() and write_k8s_log()
For now, we ignore write_journald() for the rest of this analysis
It is expected that the entire buffer returned by the read(2) system call is processed and written to disk
write_back_to_remote_consoles() is called to satisfy any remote consoles
For now, we ignore this case for the rest of this analysis
Analysis of write_k8s_log()
a. The approach taken by write_k8s_log() is to avoid memory copies of the data to be written to disk by using IO vectors with the writev(2) system call
b. 128 vectors [9] are allocated [10] as a maximum to be used while processing any one buffer
c. write_k8s_log() creates a time stamp once at the beginning and uses it as the time stamp value for all the logs it processes and writes
d. A "log" is defined to be all the bytes from the previous "newline" character up to, and including, the next newline character
At the end of the buffer, if no newlines are found, a special marker is used to record that a "partial" log record was processed
The start of a buffer assumes the beginning of a new log entry
f. A fourth vector is used when no "newline" characters are found in the remaining bytes of the buffer being processed
g. For every "newline" character encountered, 46 bytes are added to the output stream
h. 128 vectors can represent a maximum of 42 log entries (2 vectors unused)
For an 8,192 byte buffer, that is an average size of 195.048 bytes per line in order for one writev() system call to be used to write all bytes read to disk.
If the average line length is >= 196 bytes, fewer vectors are required.
If the average line length is < 196 bytes and >= 98 bytes, 1 additional writev() system calls will be required: 2 writev() calls per read().
If the average line length is < 98 bytes and >= 66 bytes, 2 additional writev() system calls will be required: 3 writev() calls per read().
If the average line length is < 66 bytes and >= 49 bytes, 3 additional writev() system calls will be required: 4 writev() calls per read().
Recommend 1,024 IO vectors to handle >= 25 byte average lines with 1 system call, and >= 13 byte average lines with 2 system calls
The IO Vector data structure consists of a 4 byte int and an array of 16 byte struct iovec entries, takikng up 4 pages of the stack
Really, we should be implementing an alternate format of writing what was read from the pipe directly to disk so that the data stream from either pipe does not impact the behavior of conmon, see https://github.com/cri-o/cri-o/pull/1605
Here is a table how this works out as the average line length shrinks (byte ranges, and number of writev(2) system calls):
A detailed analysis of the container logging behavior in
conmon
as of tagv2.0.28
.Summary recommendations:
pipe(7)
size (64 KB by default on Linux)pipe(7)
untilEWOULDBLOCK
writev(2)
system calls per bufferread(2)
An Analysis of Conmon Container Logging Behaviors
Peter Portante, May 2021 Based on the
v2.0.28
(GitHub tag) code baseconmon
uses non-blocking pipes for reading for thestdout
&stderr
pipes of a containera. Pipes created at lines 163 [3] and 178 [4] of
conmon.c
using thepipe2()
[5] system call, with theO_CLOEXEC
flagPipes are 64 KB in size, the Linux default
Does
Go
change the size of the pipes dynamically?b. Registers
stdio_cb()
, line 17 [1] ofctr_stdio.c
forpoll()
eventsIn response to being invoked,
stdio_cb()
in turn callsread_stdio()
[2], which reads from the pipe file descriptor using aSTDIO_BUF_SIZE
sized buffer (+2 as described at [7])a. Where
STDIO_BUF_SIZE
is 8 KB [6], for a total of 8,192 bytes used in memoryb. [Flag] This means that 8 read system calls are required to drain the pipe
[Question] Does the
Go
integration change the pipe size to 8,192 bytes?The
read_stdio()
only makes oneread(2)
system call[Question] Why is only one
read(2)
system call made? Does that allow the polling loop to fairly round-robin between reading fromstdout
andstderr
?read_stdio()
continue reading untilread(2)
returnsEWOULDBLOCK
[Flag] The
STDIO_BUF_SIZE
should be 64 KB to allow the most efficient use of system callsc. [Flag] This also has the effect that 8 KB reads increases the number of "partial" log records emitted by
conmon
, since the pipe can fill up whileconmon
is processing a read.d. The read buffer is a stack local variable in
read_stdio()
, but the target address used in theread(2)
system call is moved by one byte [7], which means any reads returning >= 4 KB of data will result in 2 pages being used on the stack, with a maximum of 3 pages usedwrite_back_to_remote_consoles()
write the extra byte only when remote consoles are presentd. When 1 or more bytes are returned by the
read(2)
system call inread_stdio()
:write_to_logs()
is invoked to write the logs to disk, which in turn callswrite_journald()
andwrite_k8s_log()
For now, we ignore
write_journald()
for the rest of this analysisIt is expected that the entire buffer returned by the
read(2)
system call is processed and written to diskwrite_back_to_remote_consoles()
is called to satisfy any remote consolesAnalysis of
write_k8s_log()
a. The approach taken by
write_k8s_log()
is to avoid memory copies of the data to be written to disk by using IO vectors with thewritev(2)
system callb. 128 vectors [9] are allocated [10] as a maximum to be used while processing any one buffer
c.
write_k8s_log()
creates a time stamp once at the beginning and uses it as the time stamp value for all the logs it processes and writesd. A "log" is defined to be all the bytes from the previous "newline" character up to, and including, the next newline character
At the end of the buffer, if no newlines are found, a special marker is used to record that a "partial" log record was processed
The start of a buffer assumes the beginning of a new log entry
e. When a new line is found, 3 vectors are used:
time stamp buffer [12], 44 bytes [11]
partial / full tag, 2 bytes [13]
actual log contents, N bytes [14]
f. A fourth vector is used when no "newline" characters are found in the remaining bytes of the buffer being processed
g. For every "newline" character encountered, 46 bytes are added to the output stream
h. 128 vectors can represent a maximum of 42 log entries (2 vectors unused)
For an 8,192 byte buffer, that is an average size of 195.048 bytes per line in order for one
writev()
system call to be used to write all bytes read to disk.If the average line length is >= 196 bytes, fewer vectors are required.
If the average line length is < 196 bytes and >= 98 bytes, 1 additional
writev()
system calls will be required: 2writev()
calls perread()
.If the average line length is < 98 bytes and >= 66 bytes, 2 additional
writev()
system calls will be required: 3writev()
calls perread()
.If the average line length is < 66 bytes and >= 49 bytes, 3 additional
writev()
system calls will be required: 4writev()
calls perread()
.Recommend 1,024 IO vectors to handle >= 25 byte average lines with 1 system call, and >= 13 byte average lines with 2 system calls
int
and an array of 16 bytestruct iovec
entries, takikng up 4 pages of the stackReally, we should be implementing an alternate format of writing what was read from the pipe directly to disk so that the data stream from either pipe does not impact the behavior of
conmon
, see https://github.com/cri-o/cri-o/pull/1605Here is a table how this works out as the average line length shrinks (byte ranges, and number of
writev(2)
system calls):[1] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L17 [2] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L101 [3] https://github.com/containers/conmon/blob/v2.0.28/src/conmon.c#L163 [4] https://github.com/containers/conmon/blob/v2.0.28/src/conmon.c#L178 [5] https://man7.org/linux/man-pages/man2/pipe.2.html [6] https://github.com/containers/conmon/blob/v2.0.28/src/config.h#L6 [7] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L103 [8] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L131 [9] https://github.com/containers/conmon/blob/v2.0.28/src/utils.h#L224 [10] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L60 [11] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L22 [12] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L330 [13] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L342 [14] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L349