aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.24k stars 727 forks source link

IPv6 containers experience connectivity issues with large simultaneous file downloads #2817

Open chen-anders opened 5 months ago

chen-anders commented 5 months ago

What happened:

Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.

These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.

Example error output

Sometimes we see errors around establishing connections over HTTPS:

test9 | Connecting to embed-ssl.wistia.com (embed-ssl.wistia.com)|2600:9000:244d:7800:1e:c86:4140:93a1|:443... connected.
test9 | Unable to establish SSL connection.
test9 | exit status 4
test3 | Resolving embed-ssl.wistia.com (embed-ssl.wistia.com)... failed: Try again.
test3 | wget: unable to resolve host address 'embed-ssl.wistia.com'
test3 | exit status 4

We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.

What you expected to happen:

Downloads complete without connection errors

How to reproduce it (as minimally and precisely as possible):

We have a Procfile that runs 9 downloads of a 700MB file in parallel.

Debian Slim Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash

apt-get update && apt-get install -y wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Alpine Container

Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash `

apk add wget # use non-busybox wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile

Anything else we need to know?:

Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.

Environment: Kubernetes Versions:

Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:

-AL2: 5.10.209-198.858.amzn2.aarch64 / 5.10.209-198.858.amzn2.x86_64

Reproduced on AWS VPC CNI versions:

Instance types used:

jdn5126 commented 4 months ago

@chen-anders I suggest filing an AWS support case here, as the complexity for this issue will likely require debug sessions and cluster access.

In the meantime, I recommend collecting the node logs from the AL2 reproduction by executing the following bash script: https://github.com/awslabs/amazon-eks-ami/blob/main/log-collector-script/linux/eks-log-collector.sh

chen-anders commented 4 months ago

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

jdn5126 commented 4 months ago

Hi @jdn5126 ,

We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there?

I see that Bottlerocket has a section on logs: https://github.com/bottlerocket-os/bottlerocket#logs, but it does not look like it collects everything that we would need. I wonder if we can use the same strategy laid out there to execute the EKS AMI bash script

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

acj commented 2 months ago

Sorry for the delay on our end. We're still planning to collect and share logs.

acj commented 2 months ago

We've repeated our tests over the past few days and are not able to repro the download stall anymore. We haven't made any related changes to our infrastructure and are still puzzled by the behavior.

A few notes for anyone who might run into the same problem:

Hopefully this is resolved. Thanks for your help!