bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.82k stars 521 forks source link

NodePort services inaccessible/blocked by iptables #3732

Open nike21oct opened 10 months ago

nike21oct commented 10 months ago

hi ,i have a EKS cluster which is using bottlerocket AMI and nginx as a ingress controller and when i implemented these IP tables rules by bootstrap-container my application stop open from outside the cluster , i mean my ingress is not functioning my nginx ingress controller pod went to crashloopbackoff , ngix controller load balancer in aws in target group the Protocol : Port is TLS: 32443 and health check is using protocol http and port is 32002, so what should i need to do? please help me here

!/usr/bin/env bash

Flush iptables rules iptables -F

3.4.1.1 Ensure IPv4 default deny firewall policy (Automated) iptables -P INPUT DROP iptables -P OUTPUT DROP iptables -P FORWARD DROP

Allow inbound traffic for kubelet (so kubectl logs/exec works) iptables -I INPUT -p tcp -m tcp --dport 10250 -j ACCEPT

Adding nodeport of nginx ingress controlleer

iptables -I INPUT -p tcp -m tcp --dport 32443 -j ACCEPT # For TLS traffic iptables -I INPUT -p tcp -m tcp --dport 32002 -j ACCEPT # For Health Checks iptables -I INPUT -p tcp -m tcp --dport 32080 -j ACCEPT

3.4.1.2 Ensure IPv4 loopback traffic is configured (Automated) iptables -A INPUT -i lo -j ACCEPT iptables -A OUTPUT -o lo -j ACCEPT iptables -A INPUT -s 127.0.0.0/8 -j DROP

3.4.1.3 Ensure IPv4 outbound and established connections are configured (Manual) iptables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

Flush ip6tables rules ip6tables -F

3.4.2.1 Ensure IPv6 default deny firewall policy (Automated) ip6tables -P INPUT DROP ip6tables -P OUTPUT DROP ip6tables -P FORWARD DROP

Allow inbound traffic for kubelet on ipv6 if needed (so kubectl logs/exec works) ip6tables -A INPUT -p tcp --destination-port 10250 -j ACCEPT

3.4.2.2 Ensure IPv6 loopback traffic is configured (Automated) ip6tables -A INPUT -i lo -j ACCEPT ip6tables -A OUTPUT -o lo -j ACCEPT ip6tables -A INPUT -s ::1 -j DROP

3.4.2.3 Ensure IPv6 outbound and established connections are configured (Manual) ip6tables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT ip6tables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT ip6tables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT ip6tables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT ip6tables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT ip6tables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

after implementing these my nginx ingress controller pod going in crashloopbackoff

Adding nodeport of nginx ingress controlleer

iptables -I INPUT -p tcp -m tcp --dport 32443 -j ACCEPT # For TLS traffic iptables -I INPUT -p tcp -m tcp --dport 32002 -j ACCEPT # For Health Checks iptables -I INPUT -p tcp -m tcp --dport 32080 -j ACCEPT

is above rules which i added is correct or not? please help me into this

https://aws.amazon.com/blogs/containers/validating-amazon-eks-optimized-bottlerocket-ami-against-the-cis-benchmark/ I am following above documentation for doing the benchmarking of my EKS cluster

gthao313 commented 10 months ago

@nike21oct Thanks for opening this issue! would mind sharing more details with us so we can try to reproduce this issue?

what bottlerocket AMI you are using? What is you instance type? What is you EKS cluster version

Thanks!

dimitrisgiannopoulos commented 10 months ago

Hello @gthao313,

Thanks for looking into this! I can answer the questions on behalf of @nike21oct .

Bottlerocket AMI: Bottlerocket OS 1.18.0 (aws-k8s-1.24) Instance type: m6i.2xlarge EKS cluster version: 1.24

nike21oct commented 10 months ago

hi @gthao313 did you get a chance to simulate the issue?

gthao313 commented 10 months ago

@nike21oct sorry for the late reply. I still try to reproduce this issue. I'll let you know if I need more infos from you or I complete the reproduce.

nike21oct commented 10 months ago

hi @gthao313 , did you get a chance to simulate the issue?

nike21oct commented 10 months ago

hi @gthao313 , did you get a chance to simulate the issue?

gthao313 commented 10 months ago

@nike21oct I wasn't able to reproduce this issue last week. I am working on another priority item now. I'll work on this issue as my next priority. Thank you! Pleas let me know if you have any concern.

nike21oct commented 10 months ago

@gthao313 thanks for the response , I am waiting for your input.

bcressey commented 10 months ago

@nike21oct - I haven't had a chance to dig into this and try to repro, but the behavior you're seeing is sufficiently unexpected that there may be something deeper going on beyond just a missing input rule.

What distro base image are you using for your bootstrap container to run the iptables commands? If it's a newer distro like AL23 that defaults to the "nftables" backend for iptables, that could account for the behavior you're seeing. Bottlerocket uses the "legacy" backend for iptables, and the two backends don't mix. kube-proxy has logic to detect which backend is in use, but if the system is in a state where both backends were used, it wouldn't be able to correct that.

AL2 defaults to the "legacy" backend, so you could try that. Otherwise, some distros offer a way to switch between backends - Debian's iptables wiki shows some steps for that distro.

nike21oct commented 10 months ago

hi @bcressey , I followed below github link to create my bootstrap container and as per this documentation they are using alpine as a base image in Docker File

bcressey commented 10 months ago

I don't see a link, but if I check alpine:latest they are using iptables with the "nftables" backend:

/ # apk add iptables
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/aarch64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/aarch64/APKINDEX.tar.gz
(1/4) Installing libmnl (1.0.5-r2)
(2/4) Installing libnftnl (1.2.6-r0)
(3/4) Installing libxtables (1.8.10-r3)
(4/4) Installing iptables (1.8.10-r3)
Executing busybox-1.36.1-r15.trigger
OK: 16 MiB in 19 packages
/ # iptables -V
iptables v1.8.10 (nf_tables)

If you use a Debian base image instead, you can switch iptables to the "legacy" backend:

❯ docker run -it --rm -u 0 debian:bookworm-slim
root@01a8f51b742c:/# apt-get update
...
Reading package lists... Done

root@01a8f51b742c:/# apt-get install iptables
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libbsd0 libedit2 libip4tc2 libip6tc2 libjansson4 libmnl0 libnetfilter-conntrack3 libnfnetlink0 libnftables1 libnftnl11 libxtables12 netbase nftables
Suggested packages:
  firewalld kmod
The following NEW packages will be installed:
  iptables libbsd0 libedit2 libip4tc2 libip6tc2 libjansson4 libmnl0 libnetfilter-conntrack3 libnfnetlink0 libnftables1 libnftnl11 libxtables12 netbase nftables
0 upgraded, 14 newly installed, 0 to remove and 0 not upgraded.
Need to get 1144 kB of archives.
After this operation, 11.3 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
...

root@01a8f51b742c:/# iptables -V
iptables v1.8.9 (nf_tables)

root@01a8f51b742c:/# update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives: using /usr/sbin/iptables-legacy to provide /usr/sbin/iptables (iptables) in manual mode

root@01a8f51b742c:/# update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives: using /usr/sbin/ip6tables-legacy to provide /usr/sbin/ip6tables (ip6tables) in manual mode

root@01a8f51b742c:/# iptables -V
iptables v1.8.9 (legacy)

root@01a8f51b742c:/# ip6tables -V
ip6tables v1.8.9 (legacy)

This could be done at build time for your bootstrap container.

nike21oct commented 10 months ago

hi @bcressey , I forgot to paste the link , below is the link of that documentation and github link of the docker file and bootstrap container script . ***AWS documention for implementing CIS benchmark https://aws.amazon.com/blogs/containers/validating-amazon-eks-optimized-bottlerocket-ami-against-the-cis-benchmark/

****Github link for bootstrap container *** https://github.com/aws-samples/containers-blog-maelstrom/tree/main/cis-bottlerocket-benchmark-eks/bottlerocket-cis-bootstrap-image

But I have a question what makes it difference using apline or debian as a base image in docker file except only the term legacy backend , my issue is that ip tables rules are implementing as per the bootstrap script and it block all traffic by default as specifies by command in script but I am using nginx as in ingrees controller in my EKS cluster which is using creating NLB on AWS cloud and using nodeport to communicate with the target group and target is my worker node which is EC2 instance and when I am allowing nodeport in my iptables rule it is not working as it should be.

** block all traffic as per the rule specify in bootstrap script *** iptables -P INPUT DROP iptables -P OUTPUT DROP iptables -P FORWARD DROP

****Allowing nodeport of nginx ingress controlleer****

iptables -I INPUT -p tcp -m tcp --dport 32443 -j ACCEPT # For TLS traffic iptables -I INPUT -p tcp -m tcp --dport 32002 -j ACCEPT # For Health Checks iptables -I INPUT -p tcp -m tcp --dport 32080 -j ACCEPT

so will it make difference using alpine or debain as a base image in bootstrap container and how?

bcressey commented 10 months ago

I am using nginx as in ingrees controller in my EKS cluster which is using creating NLB on AWS cloud and using nodeport to communicate with the target group and target is my worker node which is EC2 instance and when I am allowing nodeport in my iptables rule it is not working as it should be.

The iptables -P <X> DROP set a default drop behavior for the chain rather than a default allow. The important point there is that this is just a default; it only happens if nothing matches a rule.

kube-proxy will populate the chain with rules for node port services. If the system is working properly, these rules will take precedence over the default behavior - whether that's allow or deny.

iptables -L KUBE-NODEPORTS -n -v -t nat should show these node port rules.

Will it make difference using alpine or debain as a base image in bootstrap container and how?

Yes for the reasons I mentioned above. The iptables command in the bootstrap container image needs to be using the "legacy" backend. Both Alpine and Debian now default to the "nftables" backend. Debian can be set to use the "legacy" backend by running additional commands - the update-alternatives commands I showed.

If you use the "wrong" backend to configure the default behavior, then the system ends up in a confused state. The "nftables" backend in the kernel will only know about the default drop rule, and "legacy" backend will know about the node port rules.

You don't need to add node port rules to the bootstrap container, and I don't recommend that. The only rules you need are to allow access that nothing else would enable automatically - like SSH on 22/TCP, or kubelet on 10250/TCP.

If you fix the iptables command you're running in the bootstrap container to use the legacy backend, that should clear up the majority of problems you're seeing. Beyond that, you can allow traffic to kubelet if you want kubectl exec and kubectl logs to work, or to kube-proxy on 10249/TCP if you want to scrape its metrics.

nike21oct commented 10 months ago

hi @bcressey , thanks for your continuous response, your suggestion to use debian as a base image and cmd "update-alternatives" in docker file makes the pod of nginx ingress controller in running state but there is other issue which is that target (worker node) is not in healthy state , i have three targets in the target group and health status of these targets is unhealthy and I observe one more thing is that for some time health status came to healthy for one or two of the target instance and after some time it went unhealthy , so it means the target health is not persistent it comes healthy and then become unhealthy image

so to troubleshoot this I open port 10254 in iptable rules (this port is used by ingress-nginx-controller-metrics with node port 32002) and 10249 port which is suggested by you . Below is my IPtable script which is used by bootstrap container.

#!/usr/bin/env bash

# Flush iptables rules
iptables -F

# 3.4.1.1 Ensure IPv4 default deny firewall policy (Automated)
iptables -P INPUT DROP
iptables -P OUTPUT DROP
iptables -P FORWARD DROP

# Allow inbound traffic for kubelet (so kubectl logs/exec works)
iptables -I INPUT -p tcp -m tcp --dport 10250 -j ACCEPT

**#These two rules I added from myself** 
**iptables -I INPUT -p tcp -m tcp --dport 10254 -j ACCEPT**
**iptables -I INPUT -p tcp -m tcp --dport 10249 -j ACCEPT**

# 3.4.1.2 Ensure IPv4 loopback traffic is configured (Automated)
iptables -A INPUT -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -s 127.0.0.0/8 -j DROP

# 3.4.1.3 Ensure IPv4 outbound and established connections are configured (Manual)
iptables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

# Flush ip6tables rules 
ip6tables -F

# 3.4.2.1 Ensure IPv6 default deny firewall policy (Automated)
ip6tables -P INPUT DROP
ip6tables -P OUTPUT DROP
ip6tables -P FORWARD DROP

# Allow inbound traffic for kubelet on ipv6 if needed (so kubectl logs/exec works)
ip6tables -A INPUT -p tcp --destination-port 10250 -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --destination-port 10254 -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --destination-port 10249 -j ACCEPT

# 3.4.2.2 Ensure IPv6 loopback traffic is configured (Automated)
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A OUTPUT -o lo -j ACCEPT
ip6tables -A INPUT -s ::1 -j DROP

# 3.4.2.3 Ensure IPv6 outbound and established connections are configured (Manual)
ip6tables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

Please need your guidance on this also.

ajpaws commented 10 months ago

@bcressey @nike21oct

I changed the iptables backend from ngtables to legacy for both bootstrap and validating containers with the PR https://github.com/aws-samples/containers-blog-maelstrom/pull/116

nike21oct commented 9 months ago

hi @bcressey , thanks for your continuous response, your suggestion to use debian as a base image and cmd "update-alternatives" in docker file makes the pod of nginx ingress controller in running state but there is other issue which is that target (worker node) is not in healthy state , i have three targets in the target group and health status of these targets is unhealthy and I observe one more thing is that for some time health status came to healthy for one or two of the target instance and after some time it went unhealthy , so it means the target health is not persistent it comes healthy and then become unhealthy image

so to troubleshoot this I open port 10254 in iptable rules (this port is used by ingress-nginx-controller-metrics with node port 32002) and 10249 port which is suggested by you . Below is my IPtable script which is used by bootstrap container.

#!/usr/bin/env bash

# Flush iptables rules
iptables -F

# 3.4.1.1 Ensure IPv4 default deny firewall policy (Automated)
iptables -P INPUT DROP
iptables -P OUTPUT DROP
iptables -P FORWARD DROP

# Allow inbound traffic for kubelet (so kubectl logs/exec works)
iptables -I INPUT -p tcp -m tcp --dport 10250 -j ACCEPT

**#These two rules I added from myself** 
**iptables -I INPUT -p tcp -m tcp --dport 10254 -j ACCEPT**
**iptables -I INPUT -p tcp -m tcp --dport 10249 -j ACCEPT**

# 3.4.1.2 Ensure IPv4 loopback traffic is configured (Automated)
iptables -A INPUT -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -s 127.0.0.0/8 -j DROP

# 3.4.1.3 Ensure IPv4 outbound and established connections are configured (Manual)
iptables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT
iptables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

# Flush ip6tables rules 
ip6tables -F

# 3.4.2.1 Ensure IPv6 default deny firewall policy (Automated)
ip6tables -P INPUT DROP
ip6tables -P OUTPUT DROP
ip6tables -P FORWARD DROP

# Allow inbound traffic for kubelet on ipv6 if needed (so kubectl logs/exec works)
ip6tables -A INPUT -p tcp --destination-port 10250 -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --destination-port 10254 -j ACCEPT
ip6tables -A INPUT -p tcp -m tcp --destination-port 10249 -j ACCEPT

# 3.4.2.2 Ensure IPv6 loopback traffic is configured (Automated)
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A OUTPUT -o lo -j ACCEPT
ip6tables -A INPUT -s ::1 -j DROP

# 3.4.2.3 Ensure IPv6 outbound and established connections are configured (Manual)
ip6tables -A OUTPUT -p tcp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A OUTPUT -p udp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A OUTPUT -p icmp -m state --state NEW,ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p tcp -m state --state ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p udp -m state --state ESTABLISHED -j ACCEPT
ip6tables -A INPUT -p icmp -m state --state ESTABLISHED -j ACCEPT

Please need your guidance on this also.

hi @bcressey , Any idea how i can debug on this part? Need your help on this part also

bcressey commented 9 months ago

How do I resolve a failed health check for a load balancer in Amazon EKS? might have some useful steps to try.

Beyond that - if you comment out the "default drop" iptables commands in your bootstrap script and the health checks start passing, then that indicates another port needs to be opened. Possibly 80 or 443 if those are the target ports for your service.

nike21oct commented 9 months ago

hi @bcressey , yes after comment out "default drop" the health checks start passing . Then after as you know that the Network load balancer checks health check on port 32002 which is mapped to port 10254 , so I opened port 10254 which is target port not 80 or 443 and did uncomment the "default drop" to check and i observe that health start getting failed again , Below is the port definition in service yaml

ports:
  - name: metrics
    nodePort: 32002
    port: 10254
    protocol: TCP
    targetPort: metrics

iptables -I INPUT -p tcp -m tcp --dport 10254 -j ACCEPT

Another question why health check is getting passed for some time and failed after some time for targets it means it is not persistance , why this kind of behaviour ?

nike21oct commented 9 months ago

hi @bcressey , any idea or your input on this?

bcressey commented 9 months ago

According to https://github.com/aws-samples/containers-blog-maelstrom/issues/73 you are also using Cilium CNI which I am less familiar with in an AWS context.

Migrating Cilium from Legacy iptables Routing to Native eBPF Routing in Production has this quote from Cilium release notes:

We introduced eBPF-based host-routing in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve a faster network namespace switch compared to regular veth device operation.

A note on Cilium's iptables usage says:

To my surprise, cilium doesn’t periodically synchronize those rules like kube-proxy. If you somehow remove a rule in its custom chain, you have to add it back manually or restart cilium-agent.

So there are two avenues you should explore. If you're using the native eBPF based routing with Cilium, then you may not have any iptables rules related to Cilium at all. However, this might mean that the kernel will apply default-drop to any packets even if Cilium knows about them. In that case my conclusion would be that setting iptables to default-drop isn't compatible with using Cilium in this mode, and you just shouldn't combine them in this way. My advice would be to document it as an exception for compliance purposes and move on.

If you're using the legacy iptables based routing with Cilium, then you should have all the necessary iptables rules. However, if it's the case that they aren't reapplied periodically, and if it's also the case that you're changing iptables settings on existing nodes, then that may be erasing the rules that Cilium installed. This could also be a factor in the native eBPF based routing mode, if Cilium is meant to put fallback iptables rules in place.

nike21oct commented 9 months ago

hi @bcressey , i checked and I found that we are using host routing to legacy in cilium CNI and yes I am changing IP tables rules on my existing node and it is failing. I am stuck and do not understand how I will solve my problem of target health check failing .