aws-samples / aws-gateway-load-balancer-tunnel-handler

AWS Gateway Load Balancer Tunnel Handler. See the blog post linked below for more information.
MIT No Attribution
34 stars 12 forks source link

Race Condition on setting net.ipv4.conf.gwi-<eni>.rp_filter #3

Closed jak-sdk closed 1 year ago

jak-sdk commented 1 year ago

Environment Details:

I am currently running /opt/gwlbtun -c /opt/gwlbtun-nat-setup.sh With /opt/gwlbtun-nat-setup.sh:

#!/bin/bash
echo "### Setting up two-armed NAT ###"
echo Mode is $1, In Int is $2, Out Int is $3, ENI is $4

echo 0 > /proc/sys/net/ipv4/conf/$2/rp_filter

Sourced from https://aws.amazon.com/blogs/networking-and-content-delivery/how-to-integrate-linux-instances-with-aws-gateway-load-balancer/

System details:

uname -a: Linux ip-10-20-140-27 5.19.0-1025-aws #26~22.04.1-Ubuntu SMP Mon Apr 24 01:58:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

AMI: ami-01dd271720c1ba44f

os-release: VERSION="22.04.2 LTS (Jammy Jellyfish)"

Issue Description

I typically see the first packet get through the new opened interface, and rp_filter is set to 0 by the script. But after ~40ms rp_filter is set to 2

Pinging every 10ms on a freshly started gwlbtun:

[root@ip-10-20-139-18 bin]# ping 159.79.82.2 -c10 -i0.01
PING 159.79.82.2 (159.79.82.2) 56(84) bytes of data.
64 bytes from 159.79.82.2: icmp_seq=1 ttl=247 time=92.9 ms
64 bytes from 159.79.82.2: icmp_seq=2 ttl=247 time=78.4 ms
64 bytes from 159.79.82.2: icmp_seq=3 ttl=247 time=63.9 ms
64 bytes from 159.79.82.2: icmp_seq=4 ttl=247 time=48.0 ms

--- 159.79.82.2 ping statistics ---
10 packets transmitted, 4 received, 60% packet loss, time 141ms
rtt min/avg/max/mdev = 48.044/70.846/92.960/16.694 ms, pipe 6

Here we see that the interface goes from not existing, to being created with rp_filter=0, to then set to 2:

for i in {1..1000}; do date +"%T.%6N" && cat /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter && echo && sleep 0.01; done
...
09:45:24.320715 cat: /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter: No such file or directory
09:45:24.329948   0
09:45:24.424472   2
09:45:24.490558   2
09:45:24.515402   2
...

The workaround for this issue seems to be to delay the rp_filter update, e.g.

#!/bin/bash
echo "### Setting up two-armed NAT ###"
echo Mode is $1, In Int is $2, Out Int is $3, ENI is $4

sleep 0.1
echo 0 > /proc/sys/net/ipv4/conf/$2/rp_filter

However this causes initial latency and doesn't guarantee to fix it, because if the delayed update to =2 took >100ms then we would be broken again (and under load this could be possible if not likely)

Note also that the default net.ipv4.conf.default.rp_filter is set to 0 and this is not respected on new gwi-<eni> interfaces:

root@ip-10-20-140-27:~# sysctl -a | grep "net.ipv4.conf.default.rp_filter"
net.ipv4.conf.default.rp_filter = 0

Can someone help me understand what might be happening and how best to ensure rp_filter is set to 0? This feels like a bug but I'm not sure what is setting it to 2 yet.

Thanks! Jak

grayaw commented 1 year ago

Interesting. I double checked the code (both posted here and in dev) and there is a race condition, as the tunnel interface gets created by each tunnel thread that launches (so the 1st one needs to get going before the entries exist). However, the code async launches those threads then calls the create script, so if any of the threads hasn't gotten to that point yet, the create script could get called without the device present yet.

I'm poking at this as part of a small code rework that's underway to fix this issue.

Edit: Looking more at it, allocateHandle() is called as part of the parameter build for launching the threads, so the device should be there before the create script callback. I'm trying to reproduce this on a test instance so I can poke at this more.

jak-sdk commented 1 year ago

Hi, thanks for the response.

the create script could get called without the device present yet

I can collect diagnostics within the create script if that helps? e.g. ip a show tun

I suspect that the create script is working correctly, as I briefly see the interface as having rp_filter = 0, it's just that after 40ms it's set to 2, do the async threads set this on the interface after creating it? I'm wondering if the script is running before they finish, and they are setting it back to 2?

grayaw commented 1 year ago

Not explicitly. The sequence goes:

So not sure yet how this is happening. I'm poking at this and doing the same sort of script you're running:

while true; do date +"%X.%N" && cat /proc/sys/net/ipv4/conf/gwi-bWMfA7Uym7G/rp_filter; done 

I get:

17:10:59.588477537 cat: /proc/sys/net/ipv4/conf/gwi-bWMfA7Uym7G/rp_filter: No such file or directory
17:10:59.592640324 2
(quite a few lines)
17:10:59.747930861 2
17:10:59.750452571 0
17:10:59.752512843 0

Which is correct for what I have (the all/rp_filter is 2, and the CREATE script sets to 0).

Does this happen on every try for you, or is it occasional?

I'm using Amazon Linux 2023, wondering if Ubuntu is doing something different (maybe NetworkManager is doing something?) - any chance you can see if you can reproduce on AL2023?

jak-sdk commented 1 year ago

Thanks, I'll try to reproduce it on AL2023 tomorrow. If I can't then I'll see if I can find out what is changing the value with eBPF.

Does this happen on every try for you, or is it occasional?

It was quite consistent, but I wasn't changing much between tests other than trimming the script down to identify the issue, and adding delay etc.

Thanks for looking into this

grayaw commented 1 year ago

I also just uploaded a dev branch that has a lot of code cleanups and some other optimizations - if you can try with that and see if the issue still occurs, that would be helpful.

jak-sdk commented 1 year ago

Hi, apologies but I haven't been able to reproduce yet on AL2023. I've had some trouble compiling the dev branch, specifically to do with the extra Boost depenedency.

jak-sdk commented 1 year ago

Still having trouble, when running on AL2023 I get

/opt/gwlbtun: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found

I have been running the dev branch on Ubuntu though, and unfortunately I still see the same issue

jak-sdk commented 1 year ago

Hi,

I tracked this down to an issue with Ubuntu defaults.

Here we see the race between gwlbtun and systemd-sysctl:

# using eBPF to trace what processes are modifying rp_filter
1.102612000   15855  gwlbtun-nat-set    11   0 /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter
1.196998000   15858  systemd-sysctl      3   0 /proc/sys/net/ipv4/conf/gwo-khyhu33zjx6/rp_filter
1.200674000   15857  systemd-sysctl      3   0 /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter

# a different run where gwlbtun ran after systemd
0.713088000   15956  systemd-sysctl      3   0 /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter
0.684564000   15955  gwlbtun-nat-set    11   0 /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter
0.727283000   15957  systemd-sysctl      3   0 /proc/sys/net/ipv4/conf/gwo-khyhu33zjx6/rp_filter

Which led me to look a little closer at the defaults:

root@ip-10-20-131-201:~# grep "\*.*rp_filter" /usr/lib/sysctl.d/50-default.conf
net.ipv4.conf.*.rp_filter = 2

So my apologies but this globbed declaration caught me off guard!

Removing the line and restarting systemd-sysctl and we see

0.892057000   17533  gwlbtun-nat-set    11   0 /proc/sys/net/ipv4/conf/gwi-khyhu33zjx6/rp_filter

With no interference from systemd :)

Thank you for your time spent looking at this, Regards, Jak

P.s. As a final comment, I don't quite understand the reason for the glob when I thought that is what net.ipv4.conf.default.rp_filter exists for