Closed CerberusQc closed 2 weeks ago
Notes
while true; do curl -H "Content-Type: application/x-www-form-urlencoded" -d "message_size=10000" -X POST http://127.0.0.1:5555; done
will reproduce rapidly ^.
Might be worth noting as well that logs are actually written successfully (to the json log file), and that doesn't hang.
tail -f ~/.local/share/nerdctl/1935db59/containers/default/XXX/XXX-json.log
Displaying them with nerdctl logs -f
will definitely hang though - but just breaking and calling nerdctl logs -f
again will resume appropriately.
I know next to nothing about logs here, so, take whatever I say with a grain of salt, but it feels to me the problem would rather be somewhere in viewLogsJSONFileDirect
(and I have no idea why your change would "fix" it @CerberusQc )
Will look more on my side.
Taking that back.
I cannot repro.
nerdctl logs -f
will hang for some time, but given enough more messages, it will resume.
@CerberusQc could you try on your side with my ridiculous curl loop and see what you get?
(not saying there is no bug - the "freeze" is concerning - but it does not feel like it gets hung forever)
Hey @apostasie, thanks for the follow up. Small messages is processed correctly, this is only when you send bigger message. Originally it happened in a Factorio Game server where a certain state is causing a huge JSON to be dumped in the server logs and this was causing the whole server to be stuck (could be stuck for days) until I restart the container.
while true; do curl -H "Content-Type: application/x-www-form-urlencoded" -d "message_size=100000" -X POST http://127.0.0.1:5555; done
(note the added 0 in the message size) doing this with bigger message get stuck instantly and I have waited over 10 min now and the process is still stuck.
I did some test and the message get stuck around 64 * 1024 bytes (which is exactly the size of the MaxScanTokenSize by default of bufio library
from bufio/scan.go
const (
// MaxScanTokenSize is the maximum size used to buffer a token
// unless the user provides an explicit buffer with [Scanner.Buffer].
// The actual maximum token size may be smaller as the buffer
// may need to include, for instance, a newline.
MaxScanTokenSize = 64 * 1024
startBufSize = 4096 // Size of initial allocation for buffer.
)
// NewScanner returns a new [Scanner] to read from r.
// The split function defaults to [ScanLines].
func NewScanner(r io.Reader) *Scanner {
return &Scanner{
r: r,
split: ScanLines,
maxTokenSize: MaxScanTokenSize,
}
}
This would explain why my "fix" is working since I am completely bypassing the bufio.Scanner and use my own.
Just did another test just to confirm:
"message_size=65000" is working perfectly for over 5 minutes "message_size=66000" get stuck in the first loop
At least today I learned that 10000 is not 100000. 🇫🇷
More seriously, let's fix this. @CerberusQc what about you just send a PR? Maintainers here will be able to have a look and chip in.
I am a rookie in GO so I have no idea if my fix is a good solution, but yeah, I can do a PR tomorrow (almost 1am here) and I guess, Maintainers will guide me. I am not familiar with the process, is there a guideline or something I need to do with public repo like nerdctl ? Thanks!
I am a rookie in GO
We all started there.
so I have no idea if my fix is a good solution, but yeah, I can do a PR tomorrow (almost 1am here) and I guess, Maintainers will guide me. I am not familiar with the process, is there a guideline or something I need to do with public repo like nerdctl ? Thanks!
You can read https://github.com/containerd/project/blob/main/CONTRIBUTING.md if you want.
The gist of it is really simple:
git -S
(better, sign off and sign with your key git -S -s
)Hey @apostasie, some news here: So before trying to do my own implementation, I have decided to read the documentation of bufio and it appears it is not a bug but a "feature", using the Scanner have a fixed maxToken that the buffer can read and if you exceed the size, the memory get corrupted and the process hang silently since the Scanner is now corrupted and in a bad state.
The documentation state the following:
Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the Scanner.Buffer. When a scan stops, the reader may have advanced arbitrarily far past the last token. Programs that need more control over error handling or large tokens, or must run sequential scans on a reader, should use bufio.Reader instead.
So basically, the solution would be to switch to Reader instead of Scanner, did a quick test with
processLogFunc := func(reader io.Reader, dataChan chan string) {
defer wg.Done()
defer close(dataChan)
scanner := bufio.NewReader(reader)
for {
s, err := scanner.ReadString('\n')
if err != nil {
log.L.Errorf("failed to read log: %v", err)
return
}
dataChan <- s
}
}
And it was working fine. I will do a PR with this and link this issue.
@CerberusQc really nice catch. Who knew that reading documentation could actually help with this software thing. :p
I just had a very quick look at other uses of bufio.Scanner
in the codebase.
We might want to investigate these - especially the soci and cosign wrappers.
About your code above, you should test for EOF - right now, you are dropping the last read, right? Anyhow, let's move to PR when you feel ready so we can discuss code inline.
@apostasie thanks again -> https://github.com/containerd/nerdctl/pull/3366 I have added the handling of the EOF, not sure it test all the case, but the unit test is not working with the current master and is working with this fix.
Fixed by #3366
@CerberusQc you are a contributor now 🎉
Can you close this ticket as completed?
Hey! insert proud face here Big thanks again for guiding me through this process @apostasie .
First time and since it can be intimidating, I was always frighten to contribute to a project. I will try to be more active in the Open Source community.
Can't wait to update my version of nerdctl with the official release to fix my factorio server :D
Hey! insert proud face here Big thanks again for guiding me through this process @apostasie .
You are welcome. ;)
First time and since it can be intimidating, I was always frighten to contribute to a project. I will try to be more active in the Open Source community.
Pick the communities that are friendly AND active (if I am going to take shit, or crickets for months, I am usually forking for my own use instead). In most cases: opening good, detailed bug reports with reproducers, and sending decently structured patches with tests / intelligible commits, will be highly welcomed by most projects.
Can't wait to update my version of nerdctl with the official release to fix my factorio server :D
I am into "Satisfactory" :-).
Description
When the container's process use the stdout/stderr to log a message too big, the process hang without crashing and for eternity.
I have created a small container to debug this but originally was happening to a game server started with nerdctl.
Steps to reproduce the issue
nerdctl run -d -p 5555:5555 cerberusqc/messager:1.2
Describe the results you received and expected
I expect the logs to be able to process any size of message passed to the stdout/stderr without blocking my main process.
I think that I have identified that the problem was probably coming from
pkg/logging/logging.go
I did a quick test and replacing by a dynamic buffer just to confirm
Looks like it is now processing any size of logs without blocking the main process but I definitely don't have the experience in GO to know if this the real reason.
What version of nerdctl are you using?
1.7.6
Are you using a variant of nerdctl? (e.g., Rancher Desktop)
None
Host information
Linode VM Dedicated 16 GB 8 vCPU 16 GB Ubuntu 22.04 LTS
CONTAINERD_VERSION="1.7.19" RUNC_VERSION="1.1.13" CNI_VERSION="1.5.1"