Performance degradation in 3033.2.0?

dee-kryvenko commented 2 years ago

Description

We are upgrading from 2905.2.4 to 3033.2.0 on AWS managed with Kops using the following AMI:

data "aws_ami" "flatcar" {
  owners      = ["075585003325"]
  most_recent = true

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  filter {
    name   = "name"
    values = ["Flatcar-stable-${var.flatcar_version}*"]
  }
}

And we are getting what seems to be a performance hit. We have tightly limited workloads:

        resources:
          requests:
            memory: 128Mi
            cpu: 50m
          limits:
            memory: 128Mi
            cpu: 500m

And some of them (specifically - based on Java SpringBoot) just unable to start after the upgrade. They just take ages to init the Java code until the probe backs off and restarts the container. We have ruled out everything else, i.e. kops version, K8s version etc - just by swapping the node group AMI from 2905.2.4 to 3033.2.0 is what triggers this behavior, under the same resource constraints and probes configuration.

Impact

We have detected this in our test clusters, and we are not able to upgrade our prod clusters. If a bunch of workloads will just unable to start after the rolling upgrade in prod - we will have a major outage on our hands.

Environment and steps to reproduce

K8s 1.20.14, kops 1.20.3, AWS.

Expected behavior

I'd expect containers able to start with the same probes and resources constraints as they were on previous versions.

Additional information

N/A

jepio commented 2 years ago

The first thing that comes to mind is the switch to cgroupv2 - https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/. If I were you I would check if switching back allows the workloads to start with the newer Flatcar version.

Are you seeing the pods getting OOM killed? Cgroup2 and legacy cgroups perform memory accounting differently (legacy cgroups didn't account all it), so there is no guarantee that you will be able to use the same value for memory limit. CPU accounting should be similar enough for there to not be a difference.

dee-kryvenko commented 2 years ago

It is not getting OOM killed - it just veeeery slow to start. Java is known to be slow to start, but my SpringBoot applications producing like two lines of init logs and then there is no activity at all until it gets killed by the probe timeout. Which makes me feel like it is either IO or CPU throttled but I guess it might be due to the lack of memory too.

Is there any human readable explanation of what changed exactly in chroups v2 as to the memory usage?

jepio commented 2 years ago

I don't think you'll find a human readable explanation, it's spread out over many blog posts and conference talks.

The best resource is probably this section https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory, and the memory.stat list. The biggest changes to memory controller in cgroup2 would be including kernel memory allocations, TCP socket buffers and block io writeback buffers in the limit.

If you increase the memory limit, does the application start correctly? Then you could determine new limits by looking at memory.current file.

t-lo commented 2 years ago

Hi @dee-kryvenko , we had other folks reporting Java slowness with old Java versions (in the case of the report I am referring to, Java 8) in combination with cgroups v2. The issues were resolved by

upgrading the Java runtime to a recent version, or
running Flatcar in cgroups v1 legacy mode (see https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/#starting-new-nodes-with-legacy-cgroups for details)

Would you mind giving this a go and get back to us with the results?

jepio commented 2 years ago

There are cases were old java runtimes don't know how to parse cgroup2 data and don't configure heap size and thread-pool sizes optimally considering cgroup limits. That could be it, and the only solution would be - update java runtime or switch to cgroupv1 like @t-lo mentioned.

dee-kryvenko commented 2 years ago

Hmmm thank you @t-lo and @jepio - I think all applications we experienced issues with were java, but I am pretty sure some of them were running Amazon Corretto 11. We have rolled back to the older flatcar for the time being, but for the next upgrade attempt this is definitely something we'll look at.

t-lo commented 2 years ago

Thanks for getting back to us @dee-kryvenko .

To ensure your issue is actually caused by cgroups v2, it would also be very helpful if you could run 3033.2.0 in cgroups v1 mode (see https://www.flatcar.org/docs/latest/container-runtimes/switching-to-unified-cgroups/#starting-new-nodes-with-legacy-cgroups) and validate if you're still hitting performance issues.

On a more general note the maintainers team is currently investigating options to make it easier to continue using cgroupsv1 by default in future releases. Stay tuned!

sayanchowdhury commented 1 year ago

@dee-kryvenko Are you still facing the issue reported. If not, can we close this issue?

dee-kryvenko commented 1 year ago

We have since moved away from kops and flatcar so no, not having this issue anymore.

flatcar / Flatcar