containerd / nerdctl

contaiNERD CTL - Docker-compatible CLI for containerd, with support for Compose, Rootless, eStargz, OCIcrypt, IPFS, ...
Apache License 2.0
7.9k stars 587 forks source link

Message too big using internal logging cause the process to block using stdout/stderr #3343

Closed CerberusQc closed 2 weeks ago

CerberusQc commented 3 weeks ago

Description

When the container's process use the stdout/stderr to log a message too big, the process hang without crashing and for eternity.

I have created a small container to debug this but originally was happening to a game server started with nerdctl.

Steps to reproduce the issue

  1. Start a small Python Flask server nerdctl run -d -p 5555:5555 cerberusqc/messager:1.2
  2. Navigate to http://localhost:5555
  3. Enter 2-3 time 100000 bytes messages and the process hang completely

Describe the results you received and expected

I expect the logs to be able to process any size of message passed to the stdout/stderr without blocking my main process.

I think that I have identified that the problem was probably coming from pkg/logging/logging.go

        stdout := make(chan string, 10000)
    stderr := make(chan string, 10000)
    processLogFunc := func(reader io.Reader, dataChan chan string) {
        defer wg.Done()
        defer close(dataChan)
        scanner := bufio.NewScanner(reader)
        for scanner.Scan() {
            if scanner.Err() != nil {
                log.L.Errorf("failed to read log: %v", scanner.Err())
                return
            }
            dataChan <- scanner.Text()
        }
    }

I did a quick test and replacing by a dynamic buffer just to confirm

    stdout := make(chan string)
    stderr := make(chan string)
    processLogFunc := func(reader io.Reader, dataChan chan string) {
        defer wg.Done()
        defer close(dataChan)

        buf := make([]byte, 0, 4096)  // A dynamic buffer
        byteBuf := make([]byte, 1024) // Read in chunks

        for {
            n, err := reader.Read(byteBuf)
            if n > 0 {
                // Append the new data to the buffer
                buf = append(buf, byteBuf[:n]...)

                // Process complete lines in the buffer
                for {
                    index := bytes.IndexByte(buf, '\n')
                    if index == -1 {
                        // No complete line found, exit the inner loop
                        break
                    }

                    // Extract the complete line (including the newline)
                    line := buf[:index+1]

                    // Send the line to the logger and channel
                    dataChan <- string(line)

                    // Remove the processed line from the buffer
                    buf = buf[index+1:]
                }
            }

            if err != nil {
                if err != io.EOF {
                    log.L.Errorf("Failed to read log: %v", err)
                }
                break // Exit the loop on error or EOF
            }
        }

        // Process any remaining data in the buffer
        if len(buf) > 0 {
            remaining := string(buf)
            dataChan <- remaining
        }
    }

Looks like it is now processing any size of logs without blocking the main process but I definitely don't have the experience in GO to know if this the real reason.

What version of nerdctl are you using?

1.7.6

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

Linode VM Dedicated 16 GB 8 vCPU 16 GB Ubuntu 22.04 LTS

CONTAINERD_VERSION="1.7.19" RUNC_VERSION="1.1.13" CNI_VERSION="1.5.1"

apostasie commented 3 weeks ago

Notes

while true; do curl -H "Content-Type: application/x-www-form-urlencoded" -d "message_size=10000"  -X POST http://127.0.0.1:5555; done

will reproduce rapidly ^.

Might be worth noting as well that logs are actually written successfully (to the json log file), and that doesn't hang.

 tail -f ~/.local/share/nerdctl/1935db59/containers/default/XXX/XXX-json.log

Displaying them with nerdctl logs -f will definitely hang though - but just breaking and calling nerdctl logs -f again will resume appropriately.

I know next to nothing about logs here, so, take whatever I say with a grain of salt, but it feels to me the problem would rather be somewhere in viewLogsJSONFileDirect (and I have no idea why your change would "fix" it @CerberusQc )

Will look more on my side.

apostasie commented 3 weeks ago

Taking that back.

I cannot repro.

nerdctl logs -f will hang for some time, but given enough more messages, it will resume.

@CerberusQc could you try on your side with my ridiculous curl loop and see what you get?

(not saying there is no bug - the "freeze" is concerning - but it does not feel like it gets hung forever)

CerberusQc commented 3 weeks ago

Hey @apostasie, thanks for the follow up. Small messages is processed correctly, this is only when you send bigger message. Originally it happened in a Factorio Game server where a certain state is causing a huge JSON to be dumped in the server logs and this was causing the whole server to be stuck (could be stuck for days) until I restart the container.

while true; do curl -H "Content-Type: application/x-www-form-urlencoded" -d "message_size=100000" -X POST http://127.0.0.1:5555; done

(note the added 0 in the message size) doing this with bigger message get stuck instantly and I have waited over 10 min now and the process is still stuck.

I did some test and the message get stuck around 64 * 1024 bytes (which is exactly the size of the MaxScanTokenSize by default of bufio library

from bufio/scan.go

const (
    // MaxScanTokenSize is the maximum size used to buffer a token
    // unless the user provides an explicit buffer with [Scanner.Buffer].
    // The actual maximum token size may be smaller as the buffer
    // may need to include, for instance, a newline.
    MaxScanTokenSize = 64 * 1024

    startBufSize = 4096 // Size of initial allocation for buffer.
)

// NewScanner returns a new [Scanner] to read from r.
// The split function defaults to [ScanLines].
func NewScanner(r io.Reader) *Scanner {
    return &Scanner{
        r:            r,
        split:        ScanLines,
        maxTokenSize: MaxScanTokenSize,
    }
}

This would explain why my "fix" is working since I am completely bypassing the bufio.Scanner and use my own.

CerberusQc commented 3 weeks ago

Just did another test just to confirm:

"message_size=65000" is working perfectly for over 5 minutes "message_size=66000" get stuck in the first loop

apostasie commented 3 weeks ago

At least today I learned that 10000 is not 100000. 🇫🇷

apostasie commented 3 weeks ago

More seriously, let's fix this. @CerberusQc what about you just send a PR? Maintainers here will be able to have a look and chip in.

CerberusQc commented 3 weeks ago

I am a rookie in GO so I have no idea if my fix is a good solution, but yeah, I can do a PR tomorrow (almost 1am here) and I guess, Maintainers will guide me. I am not familiar with the process, is there a guideline or something I need to do with public repo like nerdctl ? Thanks!

apostasie commented 3 weeks ago

I am a rookie in GO

We all started there.

so I have no idea if my fix is a good solution, but yeah, I can do a PR tomorrow (almost 1am here) and I guess, Maintainers will guide me. I am not familiar with the process, is there a guideline or something I need to do with public repo like nerdctl ? Thanks!

You can read https://github.com/containerd/project/blob/main/CONTRIBUTING.md if you want.

The gist of it is really simple:

CerberusQc commented 2 weeks ago

Hey @apostasie, some news here: So before trying to do my own implementation, I have decided to read the documentation of bufio and it appears it is not a bug but a "feature", using the Scanner have a fixed maxToken that the buffer can read and if you exceed the size, the memory get corrupted and the process hang silently since the Scanner is now corrupted and in a bad state.

Same problem in 2018

Blog post about this

Documentation

The documentation state the following:

Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the Scanner.Buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token. Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.

So basically, the solution would be to switch to Reader instead of Scanner, did a quick test with

    processLogFunc := func(reader io.Reader, dataChan chan string) {
        defer wg.Done()
        defer close(dataChan)
        scanner := bufio.NewReader(reader)

        for {
            s, err := scanner.ReadString('\n')
            if err != nil {
                log.L.Errorf("failed to read log: %v", err)
                return
            }
            dataChan <- s
        }
    }

And it was working fine. I will do a PR with this and link this issue.

apostasie commented 2 weeks ago

@CerberusQc really nice catch. Who knew that reading documentation could actually help with this software thing. :p

I just had a very quick look at other uses of bufio.Scanner in the codebase. We might want to investigate these - especially the soci and cosign wrappers.

About your code above, you should test for EOF - right now, you are dropping the last read, right? Anyhow, let's move to PR when you feel ready so we can discuss code inline.

CerberusQc commented 2 weeks ago

@apostasie thanks again -> https://github.com/containerd/nerdctl/pull/3366 I have added the handling of the EOF, not sure it test all the case, but the unit test is not working with the current master and is working with this fix.

apostasie commented 2 weeks ago

Fixed by #3366

@CerberusQc you are a contributor now 🎉

Can you close this ticket as completed?

CerberusQc commented 2 weeks ago

Hey! insert proud face here Big thanks again for guiding me through this process @apostasie .

First time and since it can be intimidating, I was always frighten to contribute to a project. I will try to be more active in the Open Source community.

Can't wait to update my version of nerdctl with the official release to fix my factorio server :D

apostasie commented 2 weeks ago

Hey! insert proud face here Big thanks again for guiding me through this process @apostasie .

You are welcome. ;)

First time and since it can be intimidating, I was always frighten to contribute to a project. I will try to be more active in the Open Source community.

Pick the communities that are friendly AND active (if I am going to take shit, or crickets for months, I am usually forking for my own use instead). In most cases: opening good, detailed bug reports with reproducers, and sending decently structured patches with tests / intelligible commits, will be highly welcomed by most projects.

Can't wait to update my version of nerdctl with the official release to fix my factorio server :D

I am into "Satisfactory" :-).