aws / amazon-ecs-service-connect-agent

Amazon ECS Service Connect Agent
Apache License 2.0
27 stars 10 forks source link

Bug: nofile soft limit on EKS Fargate causes connection limits and crashes #71

Closed visit1985 closed 2 months ago

visit1985 commented 6 months ago

Summary

We have workloads running on EKS Fargate with an aws-appmesh-envoy sidecar injected by AWS App Mesh Controller. The appnet agent process (PID 1) has a nofile soft limit of 65535, while the forked envoy process has a nofile soft limit of 1024 only.

kubectl exec -i -t -n default example-5ff7dbfc5d-strcr -c envoy -- sh
sh-4.2$ cat /proc/1/cmdline; echo
/usr/bin/agent
sh-4.2$ grep open /proc/1/limits
Max open files            65535                65535                files
sh-4.2$ cat /proc/31/cmdline; echo
/usr/bin/envoy-c/tmp/envoy-config-459706937.yaml-linfo--drain-time-s20
sh-4.2$ grep open /proc/31/limits
Max open files            1024                 65535                files

This imposes a limits of max. ~480 possible TCP connections, since a file handle is created for each ingress/egress. Reaching the limit causes the envoy process to crash and being restarted by the appnet agent (#181), which causes outage.

Steps to Reproduce

Please refer to support case 170713370901828 for this.

Are you currently working around this issue?

We are unable to workaround this issue, because the appnet agent seems to be closed source.

karanvasnani commented 6 months ago

Thanks for your patience, continuing to track this investigation as part of https://github.com/aws/aws-app-mesh-roadmap/issues/489

karanvasnani commented 3 months ago

Re-opening this issue since the fix hasn't been released yet. As an update, we experienced delays in our release and are currently working on a new release which will include this fix. Will share an update as soon as we have one.

liubnu commented 2 months ago

Close for https://github.com/aws/aws-app-mesh-roadmap/issues/492