Open chancez opened 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Related: https://github.com/cilium/hubble/issues/713
This leads me to believe that perhaps the clusterIP is used during initial connections, and some of the future flows are using the podIP. Or something of that sort.
Yes, Cilium translates the clusterIP to a podIP as early as possible (even on the socket level if SockLB is enabled). Therefore, the actual traffic on the wire will always contain the podIP.
While it's easy to map clusterIP to service, it's less obvious for podIPs. One problem is that a pod can have multiple services selecting it. Since we process each flow individually, the node where the second flow arrives might not know what (if any) service clusterIP was used to access the pod.
We could just add all matching services to the second flow, but that might also be confusing to users. Since even if you would directly connect to a podIP, Hubble would still tell you that the flow event is associated with a service (even if there was no service involved at all). On the other hand, the current behavior is also confusing, as indicated by the number of reports we get.
@gandro Right, I expect the mapping to always be clusterIP -> service (rather than the podIP -> service), but that would require doing the lookup for every packet I assume, which is probably why the flows only show the clusterIP early in the connection, and pod later.
I assume we also have no way to inform cilium the traffic's original destination was the clusterIP after it's been translated to the podIP? That feels a bit like connection tracking, which I believe cilium does already, so I'm curious if this is a performance trade off, or a complexity trade off, or just not possible with how we implement it.
@chancez
Right, I expect the mapping to always be clusterIP -> service (rather than the podIP -> service), but that would require doing the lookup for every packet I assume, which is probably why the flows only show the clusterIP early in the connection, and pod later.
Ignoring Hubble for a moment:
With SockLB, yes we perform the translation of the clusterIP to podIP as early as possible, so we don't have to do it for every packet. But if SockLB is not available (it's an optional feature), Cilium can also do the translation on the packet level, which indeed means it will do the translation for every packet. But we only do this on the bpf_lxc trace point, so once the packet leaves the container, it's already rewritten to contain the pod IP as the destination address.
I assume we also have no way to inform cilium the traffic's original destination was the clusterIP after it's been translated to the podIP? That feels a bit like connection tracking, which I believe cilium does already, so I'm curious if this is a performance trade off, or a complexity trade off, or just not possible with how we implement it.
So yes, to be able to perform reverse NAT for reply packets, we do maintain NAT table (with SockLB) or a CT table (without SockLB) that tells if the connection was NATed. While I think in theory we could perform a lookup on every trace point in that table, we currently don't do it because it's not necessary to perform the core tasks of the datapath (i.e. policy enforcement, load balancing, encryption) and every additional map lookup does incur a per-packet overhead.
But there is also a more fundamental limitation when it comes to cross-node traffic: The above tables are local to the node where the flow originated. Once a packet is NATed (i.e. the destination clusterIP has been replaced with a destination podIP) and sent to another node, all the remote node sees is the podIP. The remote node does not have access to the NAT tables used to rewrite the packet, and thus cannot check if that particular packet was ever NATeted or not. The remote node cannot know the original destination IP (unless we introduce some form of packet encapsulation or something similar).
@gandro I think when it comes to cross-node, that's primarily going to effect the source_service
, which IMO conceptually never made sense to me, so I think that's relatively acceptable. I think that based on what you said, destination_service
would work as expected without SockLB, and could work with it as well, but currently the "works-sometimes" makes it pretty unusable.
Perhaps short-term we should just document this limitation (if not already) and in the future we may revisit with SockLB so the destination_service
metadata is always set correctly.
@gandro I think when it comes to cross-node, that's primarily going to effect the source_service, which IMO conceptually never made sense to me, so I think that's relatively acceptable.
I'm not sure I follow. Imagine the following chain of events, where xwing-pod-1 is running on node1 and deathstar-pod-2 is running on node2.
[k8s-node1] xwing-pod-1 -> deathstar-service (pre-translation) // destination_service is set
[k8s-node1] xwing-pod-1 -> deathstar-pod-2 (post-translation) // destination_service is empty, but could technically be recovered
[k8s-node1] xwing-pod-1 -> deathstar-pod-2 (from-endpoint) // destination_service is empty, but could technically be recovered
[k8s-node1] xwing-pod-1 -> deathstar-pod-2 (to-stack) // routed to node2, otherwise same as above
[k8s-node2] xwing-pod-1 -> deathstar-pod-2 (from-stack) // arriving at node2, destination_service is empty, and _not_ recoverable
[k8s-node2] xwing-pod-1 -> deathstar-pod-2 (to-endpoint) // destination_service is empty, and _not_ recoverable
[k8s-node2] deathstar-pod-2 -> xwing-pod-1 (from-endpoint) // reply packet, now the IPs are swapped and no service IP is involved yet
[k8s-node2] deathstar-pod-2 -> xwing-pod-1 (to-stack) // reply packet, same as above
[k8s-node1] deathstar-pod-2 -> xwing-pod-1 (from-stack) // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-pod-2 -> xwing-pod-1 (to-endpoint) // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-pod-2 -> xwing-pod-1 (pre-translation) // source_service is empty, but could technically be recovered
[k8s-node1] deathstar-service -> xwing-pod-1 (post-translation) // source_service is set
I think this should demonstrate that k8s-node1 could probably recover the original service field (albeit with a high performance cost), but k8s-node2 cannot, because it never saw saw the NAT happening and thus can only guess if deathstar-pod2 was accessed via PodIP, ClusterIP (it could have multiple) or maybe even NodePort.
It also demonstrated that source_service
is used for regular traffic. It is technically the destination service of the connection, but since Hubble does have a per-packet view for trace events, it is the source_service
of the event, since the clusterIP is the source IP of the reply packet when it is delivered to the xwing application.
Ah right, I was thinking only about egress on the source node, not ingress on the destination when it came to destination_service
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Is there an existing issue for this?
What happened?
When looking at Hubble flows, I only see the
source_service
anddestination_service
set on flows designated to a service sometimes.It seems that the source/destination IP is sometimes the clusterIP and sometimes it's the podIP, even though all traffic to the pod is going through the clusterIP service. Given that hubble uses the IP and port to lookup the underlying service, the fact that the source/destination IP is sometimes the clusterIP, and sometimes not, this makes sense.
The behavior is relatively predictable as well. The clusterIP is in the flows at the beginning of when everything is being started/created, or whenever I restart the backend pod. This leads me to believe that perhaps the clusterIP is used during initial connections, and some of the future flows are using the podIP. Or something of that sort.
Here's two flows from the same source, to the same destination to illustrate the problem:
Flow with the clusterIP as the destination IP (
10.96.200.135
) and destination_service (elasticsearch-master
)) correctly set:Flow with the elasticsearch podIP (10.0.0.141) instead of clusterIP, and thus a missing destination_service:
Cilium Version
Client: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/arm64 Daemon: 1.12.1 4c9a630 2022-08-15T16:29:39-07:00 go version go1.18.5 linux/arm64
Kernel Version
Linux lima-docker 5.15.0-47-generic #51-Ubuntu SMP Fri Aug 12 08:18:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:26:19Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"darwin/arm64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-19T15:42:59Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/arm64"}
Sysdump
cilium-sysdump-20220908-145348.zip
Relevant log output
No response
Anything else?
This also happens when using kube-proxy proxy replacement. I retested with KPR disabled and it still happened.
Code of Conduct