containers / conmon

An OCI container runtime monitor.
Apache License 2.0
420 stars 127 forks source link

An Analysis of Conmon Container Logging Behaviors with Recommendations #262

Open portante opened 3 years ago

portante commented 3 years ago

A detailed analysis of the container logging behavior in conmon as of tag v2.0.28.

Summary recommendations:


An Analysis of Conmon Container Logging Behaviors

Peter Portante, May 2021 Based on the v2.0.28 (GitHub tag) code base

  1. conmon uses non-blocking pipes for reading for the stdout & stderr pipes of a container

    a. Pipes created at lines 163 [3] and 178 [4] of conmon.c using the pipe2() [5] system call, with the O_CLOEXEC flag

    • Pipes are 64 KB in size, the Linux default

    • Does Go change the size of the pipes dynamically?

    b. Registers stdio_cb(), line 17 [1] of ctr_stdio.c for poll() events

  2. In response to being invoked, stdio_cb() in turn calls read_stdio() [2], which reads from the pipe file descriptor using a STDIO_BUF_SIZE sized buffer (+2 as described at [7])

    a. Where STDIO_BUF_SIZE is 8 KB [6], for a total of 8,192 bytes used in memory

    b. [Flag] This means that 8 read system calls are required to drain the pipe

    • [Question] Does the Go integration change the pipe size to 8,192 bytes?

    • The read_stdio() only makes one read(2) system call

      • [Question] Why is only one read(2) system call made? Does that allow the polling loop to fairly round-robin between reading from stdout and stderr?

        • [Flag] We should consider having the read_stdio() continue reading until read(2) returns EWOULDBLOCK
    • [Flag] The STDIO_BUF_SIZE should be 64 KB to allow the most efficient use of system calls

    c. [Flag] This also has the effect that 8 KB reads increases the number of "partial" log records emitted by conmon, since the pipe can fill up while conmon is processing a read.

    d. The read buffer is a stack local variable in read_stdio(), but the target address used in the read(2) system call is moved by one byte [7], which means any reads returning >= 4 KB of data will result in 2 pages being used on the stack, with a maximum of 3 pages used

    • [Flag] This is only used for the remote consoles case [8], which means you could avoid the extra page use having write_back_to_remote_consoles() write the extra byte only when remote consoles are present

    d. When 1 or more bytes are returned by the read(2) system call in read_stdio():

    • write_to_logs() is invoked to write the logs to disk, which in turn calls write_journald() and write_k8s_log()

      • For now, we ignore write_journald() for the rest of this analysis

      • It is expected that the entire buffer returned by the read(2) system call is processed and written to disk

    • write_back_to_remote_consoles() is called to satisfy any remote consoles

      • For now, we ignore this case for the rest of this analysis
  3. Analysis of write_k8s_log()

    a. The approach taken by write_k8s_log() is to avoid memory copies of the data to be written to disk by using IO vectors with the writev(2) system call

    b. 128 vectors [9] are allocated [10] as a maximum to be used while processing any one buffer

    c. write_k8s_log() creates a time stamp once at the beginning and uses it as the time stamp value for all the logs it processes and writes

    d. A "log" is defined to be all the bytes from the previous "newline" character up to, and including, the next newline character

    • At the end of the buffer, if no newlines are found, a special marker is used to record that a "partial" log record was processed

    • The start of a buffer assumes the beginning of a new log entry

    e. When a new line is found, 3 vectors are used:

    1. time stamp buffer [12], 44 bytes [11]

    2. partial / full tag, 2 bytes [13]

    3. actual log contents, N bytes [14]

    f. A fourth vector is used when no "newline" characters are found in the remaining bytes of the buffer being processed

    g. For every "newline" character encountered, 46 bytes are added to the output stream

    h. 128 vectors can represent a maximum of 42 log entries (2 vectors unused)

    • For an 8,192 byte buffer, that is an average size of 195.048 bytes per line in order for one writev() system call to be used to write all bytes read to disk.

    • If the average line length is >= 196 bytes, fewer vectors are required.

    • If the average line length is < 196 bytes and >= 98 bytes, 1 additional writev() system calls will be required: 2 writev() calls per read().

    • If the average line length is < 98 bytes and >= 66 bytes, 2 additional writev() system calls will be required: 3 writev() calls per read().

    • If the average line length is < 66 bytes and >= 49 bytes, 3 additional writev() system calls will be required: 4 writev() calls per read().

    • Recommend 1,024 IO vectors to handle >= 25 byte average lines with 1 system call, and >= 13 byte average lines with 2 system calls

      • The IO Vector data structure consists of a 4 byte int and an array of 16 byte struct iovec entries, takikng up 4 pages of the stack
    • Really, we should be implementing an alternate format of writing what was read from the pipe directly to disk so that the data stream from either pipe does not impact the behavior of conmon, see https://github.com/cri-o/cri-o/pull/1605

    • Here is a table how this works out as the average line length shrinks (byte ranges, and number of writev(2) system calls):

      < 49 >= 40   5
      < 40 >= 33   6
      < 33 >= 28   7
      < 28 >= 25   8
      < 25 >= 22   9
      < 22 >= 20  10
      < 20 >= 18  11
              17  12
              16  13
         15 & 14  14
              13  16
              12  17
              11  18
              10  20
               9  22
               8  25
               7  28
               6  33
               5  40
               4  49
               3  66
               2  98
               1 196

[1] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L17 [2] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L101 [3] https://github.com/containers/conmon/blob/v2.0.28/src/conmon.c#L163 [4] https://github.com/containers/conmon/blob/v2.0.28/src/conmon.c#L178 [5] https://man7.org/linux/man-pages/man2/pipe.2.html [6] https://github.com/containers/conmon/blob/v2.0.28/src/config.h#L6 [7] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L103 [8] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_stdio.c#L131 [9] https://github.com/containers/conmon/blob/v2.0.28/src/utils.h#L224 [10] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L60 [11] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L22 [12] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L330 [13] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L342 [14] https://github.com/containers/conmon/blob/v2.0.28/src/ctr_logging.c#L349

portante commented 3 years ago

I'll be updating the description to full mark down shortly.