Closed Cellebyte closed 2 years ago
@ysksuzuki are you able to reproduce it?
@Cellebyte I'm sorry for keeping you waiting. I will check this next week.
@ysksuzuki not dramatic ^^ I‘am currently running with the QuickFix from the PR #209
@ysksuzuki are you now able to reproduce it? :D
@Cellebyte Actually, I can't. The dual-stack address pool is working fine in my kind cluster where ipFamily is dual.
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
networking:
ipFamily: dual
disableDefaultCNI: true
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
Could you elaborate on why https://github.com/vishvananda/netlink/issues/576 could be related to this dual-stack problem, and the PR you sent us can be a solution?
@ysksuzuki as the issue creator elaborates the not specifically set cidr mask could lead to issues when assigning dual-stack ips via netlink as it wrongly guesses the subnetmask when you change the order of ip addresses to ipv6 first. It tries to squeeze v6 into v4 because we only pass the ip without a cidrmask to netlink. And netlink at least in el8 based distros will fail with the error messages above because the attribute size could get to big.
@Cellebyte
the not specifically set cidr mask could lead to issues when assigning dual-stack ips via netlink as it wrongly guesses the subnetmask when you change the order of ip addresses to ipv6 first.
The author of the https://github.com/vishvananda/netlink/issues/576 reports that the mask passed into an address in 16-byte format derived from ParseIP causes the problem, and users can work around it by Mask: net.IPMask(net.ParseIP("255.255.255.0").To4())
. He doesn't mention what you are saying if I understand it correctly.
Even if that causes the problem, netlink.NewIPNet which coil calls when it adds an IP to a link device sets a cidr mask accordingly. So I don't think that can cause the problem.
https://github.com/cybozu-go/coil/blob/4766966b5fbdd603e4be607c7db94047e05eac69/v2/pkg/nodenet/pod.go#L331 https://github.com/vishvananda/netlink/blob/5e915e0149386ce3d02379ff93f4c0a5601779d5/netlink.go#L35
As for the PR you sent, I think your patch doesn't have any effects on IPs coil allocates. netutil.IPAdd returns an IP in 4-byte representation for IPv4 and in 16-byte representation for IPv6. Therefore ipv4.Mask(net.CIDRMask(32, 32))
and ipv6.Mask(net.CIDRMask(128, 128))
won't modify those IPs.
https://github.com/cybozu-go/netutil/blob/7b3ee6fd8aa95d6611bb03d0ef175cac2287f15f/calc.go#L11
It is not modifying the IPs but setting the correct mask on older netlink implementations. Try a DualStack pool on an AlmaLinux System and you will see what I mean.
I mean ipv4.Mask(net.CIDRMask(32, 32))
and ipv6.Mask(net.CIDRMask(128, 128))
won't have any effects on the IPs after netutil.IPAdd
. That always returns the same slice of bytes. Please elaborate on why/how this patch setting the correct mask on older netlink implementations
, e.g. show us example code.
@ysksuzuki the byte array length is not the issue. The mask itself is the problem because as the error message points out netlink things it is a smaller bytearray then it actually is. So it breaks in the C API. Will check if I find the corresponding code line in netlink.
Just to be clear, you could confirm that your patch fixed this problem in your environment, right? As I mentioned above, ipv4.Mask(net.CIDRMask(32, 32))
and ipv6.Mask(net.CIDRMask(128, 128))
won't have any effects on the IPs. Those always return the exact same slices of bytes. How can those affect the masks?
@ysksuzuki Yep I can confirm that it works now. I do not know. I think it is something outside the Go Code itself.
Does the problem occur without the patch? I can't imagine something outside the Go would fix a problem.
@ysksuzuki without the patch I am not able to create any veth interfaces for my Pod.
netlink: 'coild': attribute type 2 has an invalid length.
is the error message.
That's weird...
@Cellebyte Are you running on a non x86 system such as ARM64?
@ymmt2005 nope x86_64 vm AlmaLinux8
Hmm. I have no clue. The proposed patch seems a no-op, as Yusuke is saying.
I guessed that you could fix the problem because you rebuilt the Coil image.
@ymmt2005 as the issue from the netlink library shows it could happen that it treats the IPAddress without a mask as /0 which could implement wrong routes.
@Cellebyte That issue does not apply to Coil.
I'm aware of it, and Coil gives a proper netmask to each invocation of netlink.AddrAdd
or netlink.RouteAdd
.
@ymmt2005 the question still remains. Why I was not able to use the current upstream version of coil with my AlmaLinux 8 Servers ^^
@Cellebyte All I can say for now is that there is nothing to do with https://github.com/vishvananda/netlink/issues/576 and your patch itself doesn't fix this problem. You might be able to confirm it, e.g. try a rebuilt coil without the patch.
Can you investigate what is actually happening in your environment? It would be helpful if you could do that since we don't normally use Alma Linux.
@ysksuzuki yeah I could try rebuilding the image without the patch.
@ysksuzuki you were right. I had now time to further debug the bug and it seems like after I disabled CGO I am able to successfully deploy pods without any issue.
@ysksuzuki I found the issue in coil, can we merge the Pull Request?
@Cellebyte Thank you for investigating it. Could you recreate the PR or squash and tidy up the commits? And I would appreciate it if you could share why disabling CGO fixed this problem.
@ysksuzuki I added a little comment on the PR.
This bug could be related to the following issue. https://github.com/vishvananda/netlink/issues/576
coil: ghcr.io/cybozu-go/coil:2.0.14 Kubernetes Version: 1.21.11 Container Runtime: cri-o 1.21.6 Linux OS: AlmaLinux 4.18.0-348.20.1.el8_5.x86_64
Namespace with requirement for DualStack Pods
The code lines which could be the issue are here.
https://github.com/cybozu-go/coil/blob/1db9b2d01220d2d86ee43a8ce7e0f9cf35e0b313/v2/pkg/ipam/node.go#L289-L312