fsync() contention when container logs get rotated can block containers writing to stdout/stderr

rmt commented 1 year ago

What happened?

When kubelet rotates logs and asks the remote runtime to reopen logs, a blocking fsync() occurs in conmon. The pipe buffers can fill in the meantime while waiting for fsync() to complete, blocking writes to stdout/stderr.

conmon has an option called --no-sync-log which can alleviate this problem (and should arguably be the default) However, podman does not support setting this. And neither does cri-o, which uses podman.

The required changes should probably be driven by CRI-O.

This will usually not be noticed on a local disk or with low-volume logs, and very intermittently if the max-log-file-size is sufficiently large. However, we have logs on an AWS EBS (GP3) volume and there is latency involved, and the max log file size was quite low given the amount of logs produced, which made the issue very obvious and got us to dig into it.

I suspect that this is that this is behind a significant number of (usually minor) latency spikes that people see.

What did you expect to happen?

Writing to stdout from a container should not block under normal circumstances

How can we reproduce it (as minimally and precisely as possible)?

Do lots writes and fsync()s on the container log volume while producing lots logs in a tight loop from one or more k8s containers, and set the max-log-file-size lower. During log rotation / when the pipe is blocked, fewer log lines will be produced. In a real world scenario, this might be HTTP requests that are delayed.

Anything else we need to know?

Calling fsync on the critical path when writing log files is probably a bad thing

CRI-O and Kubernetes version

```console $ crio --version crio version 1.23.5 Version: 1.23.5 GitCommit: d12c2842e9e4d051a2496e1740f846ffa22b3f1d GitTreeState: clean BuildDate: 2023-01-10T21:30:25Z GoVersion: go1.17.5 Compiler: gc Platform: linux/amd64 Linkmode: dynamic BuildTags: exclude_graphdriver_devicemapper, seccomp SeccompEnabled: true AppArmorEnabled: false ``` ```console $ kubectl version Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.14", GitCommit:"3321ffc07d2f046afdf613796f9032f4460de093", GitTreeState:"clean", BuildDate:"2022-11-09T13:40:19Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.14", GitCommit:"3321ffc07d2f046afdf613796f9032f4460de093", GitTreeState:"clean", BuildDate:"2022-11-09T13:32:47Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/arm64"} ```

OS version

```console $ cat /etc/os-release NAME="Rocky Linux" VERSION="8.7 (Green Obsidian)" $ uname -a Linux ip-10-118-82-234.ec2.internal 4.18.0-425.10.1.el8_7.x86_64 #1 SMP Thu Jan 12 16:32:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux ```

Additional environment details (AWS, VirtualBox, physical, etc.)

AWS GP3 with shared volume group for /var/log/pods and /var/log/containers, which doesn't do anything good for I/O contention or consistency, of course.

rmt commented 1 year ago

Ping? I can also open bugs against the podman and/or conmon projects, if that is desired.

haircommander commented 1 year ago

sorry for the delay @rmt! I think having an option in cri-o to not sync the log makes sense to me. I don't think we can make it the default because I would expect admins to assume their logs will persist and sudden reboots would cause changes to be lost without the fsyncs. That said, I definitely think it could be a configurable option.

long term, a better way to handle this is a double buffer like in https://github.com/containers/conmon/issues/38, though at this point that's more likely to happen in conmon-rs.

@arcolife do you have any interest in taking this on?

arcolife commented 1 year ago

thanks @haircommander I'll pick this up very soon. Let me assign myself for now.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

PaulFurtado commented 11 months ago

I don't think we can make it the default because I would expect admins to assume their logs will persist and sudden reboots would cause changes to be lost without the fsyncs.

Fwiw, I'm really not so sure that this is an expectation that admins have of application logs during a system crash. Most application logging frameworks I know of do not fsync log files, and often write asynchronously via a queue to avoid blocking on IO. Additionally, if the system goes down then all the logs that were sitting in the pipe buffers or conmon buffers would also be lost along with any logs written since the last fsync, so it's not like the logs are lossless already.

On the other hand, blocking the stdout/stderr pipes of a container can cause serious performance and reliability problems that can be challenging to debug, so I'd imagine many more people would be harmed by the current blocking behavior without even knowing it unless they do a serious deep-dive and trace their application during stalls.

I'm not actually making a strong push for disabling fsync by default, but I do think both options could be considered a sane default in the context of a reliable container runtime.

Regardless of the default choice, something I'd really love to see though is an example config intended for high-performance with the justification for settings documented. Before cri-o, we had years of hard-earned experience under docker, and making the move to cri-o has been painful. Such a config would contain settings like this, async stats, etc.

haircommander commented 11 months ago

I see your point, and I agree! I think long term we can consider having fsync off by default. I think it should begin configurable, and start off as on by default to allow us to try it out and make sure there are no regressions. If we find it's better, we can move it to the default.

rmt commented 11 months ago

I'd honestly put up a PR to disable fsync on log rotation completely with a clear explanation as to why, and see if anyone argues otherwise.

I suspect that this fsync-on-rotate was for an imaginary use-case - the implementer probably thought it made sense, but it doesn't.

Important logs are, in almost all cases, shipped off to dedicated log storage systems and the current fsync-on-rotate behaviour can get in the way of that by introducing delays to writing logs, plus of course in the extreme scenario of blocking container's stdout/stderr and further impacting filesystem performance.

haircommander commented 11 months ago

I did a bit more digging. I haven't found details about the exact reason we made the behavior configurable in conmon, but I did find the release notes of https://github.com/containers/conmon/releases/tag/v2.0.20 . In situations where a node suddenly shuts off, I think this causes the kubelet to fail to read the logs which has some bad effect I can't remember exactly.

I am still open to making it configurable, but I would like to see more resilliency tested in the kubelet before we make it the default.

arcolife commented 11 months ago

I'll start by introducing the flag, a few tests and we can have a discussion about its permanency thereafter.

Thanks for the pointers so far and please continue to link any relevant issues/discussions here until then.

github-actions[bot] commented 10 months ago

A friendly reminder that this issue had no activity for 30 days.

saschagrunert commented 8 months ago

Is this still in progress?

arcolife commented 8 months ago

just haven't found the time yet. I'll make time this weekend.

github-actions[bot] commented 7 months ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 4 months ago

Closing this issue since it had no activity in the past 90 days.

rmt commented 4 months ago

I would recommend keeping this open until fixed. It is definitely a performance bottleneck for many workloads, even if it's unlikely to be noticed / pinpointed as a problem by most engineers.

kwilczynski commented 4 months ago

/reopen

openshift-ci[bot] commented 4 months ago

@kwilczynski: Reopened this issue.

In response to [this](https://github.com/cri-o/cri-o/issues/6743#issuecomment-1880815988): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kwilczynski commented 4 months ago

/remove-lifecycle stale /remove-lifecycle rotten

github-actions[bot] commented 3 months ago

A friendly reminder that this issue had no activity for 30 days.

kwilczynski commented 2 weeks ago

/remove-lifecycle stale

kwilczynski commented 2 weeks ago

/assign haircommander /assign kwilczynski

cri-o / cri-o