cilium / image-tools

Dockerfiles for cilium-runtime and cilium-builder dependencies
Other
18 stars 29 forks source link

ubuntu: Bump to latest LTS 24.04 #281

Closed sayboras closed 4 months ago

sayboras commented 5 months ago

Testing was done in https://github.com/cilium/cilium/pull/33264

lmb commented 4 months ago

Do you have output for the test failure?

sayboras commented 4 months ago

Do you have output for the test failure?

Yes, it can be found here https://github.com/cilium/cilium/actions/runs/9589413585

lmb commented 4 months ago

What is the ubuntu base image used for? It's pretty bad if the verifier test fails with a complexity error, so I'm hesitant to just keep the old version to "fix" it. cc @ti-mo seems like maybe we're using a different clang now? i was under the impression that we bake in our own clang?

sayboras commented 4 months ago

What is the ubuntu base image used for? It's pretty bad if the verifier test fails with a complexity error, so I'm hesitant to just keep the old version to "fix" it. cc @ti-mo seems like maybe we're using a different clang now? i was under the impression that we bake in our own clang?

The failure is actually due to https://github.com/cilium/image-tools/pull/279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

dylandreimerink commented 4 months ago

The failure is actually due to https://github.com/cilium/image-tools/pull/279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

That sucks, I don't like the idea of blindly revering the compiler version, since quite a lot of time has gone into that upgrade. I would like to involve @gentoo-root first before we do that revert.

sayboras commented 4 months ago

The failure is actually due to #279. I think the best way is to revert llvm version bump, and proceed with ubuntu upgrade.

That sucks, I don't like the idea of blindly revering the compiler version, since quite a lot of time has gone into that upgrade. I would like to involve @gentoo-root first before we do that revert.

Yeah, after I found out about clang version, knowing Max's good work, I would have expected that Max was trying to upgrade in cilium/cilium as per below (which is having same failures as my testing). IMO, it's pretty safe to revert LLVM 1.18 commit, and tackle the verifer issue later.

https://github.com/cilium/cilium/pull/32816

auriaave commented 4 months ago

Since a previous LLVM version seems to work, I'm not sure about this report: it suggests that MAX_USED_MAPS and MAX_USED_BTFS limits are now set to 64 (e.g. on kernel 5.15).

link to change in the bpf verifier

sayboras commented 4 months ago

Since a previous LLVM version seems to work, I'm not sure about this report: it suggests that MAX_USED_MAPS and MAX_USED_BTFS limits are now set to 64 (e.g. on kernel 5.15).

link to change in the bpf verifier

We actually have the error log >1mil statements. Additionally, net-next is actually passed, only 5.10, 5.15 and 6.1 failed.

I did a little bit investigation, it seems to be related to ENABLE_IPV6 flag in cilium bpf code, which triggers some complexity issue.

Full error log can be found in the below GHA. https://github.com/cilium/cilium/actions/runs/9544509806

auriaave commented 4 months ago

I looked for similar issues in the tracker and this is what I found:

Interestingly, TestVerifier/bpf_overlay/1 also has ENABLE_IPV6=1 (and the masquerade flag) but it changes with respect to bpf_host/1 in:

joestringer commented 4 months ago

I created https://github.com/cilium/image-tools/tree/ubuntu-22.04 to start the fork from before we merge this, so future fixes for Cilium v1.16 or older can go into that branch. From v1.17 onwards we can start adopting this.

sayboras commented 4 months ago

I looked for similar issues in the tracker and this is what I found:

* [IPv6 datapath fails to load with Maglev and host firewall due to BPF program size limit cilium#14047](https://github.com/cilium/cilium/issues/14047)

* [[arm64] BGP Control Plane + Native Routing + LB-IPAM results in BPF program too large error (1.15-dev) cilium#28651](https://github.com/cilium/cilium/issues/28651)

Interestingly, TestVerifier/bpf_overlay/1 also has ENABLE_IPV6=1 (and the masquerade flag) but it changes with respect to bpf_host/1 in:

* `MAX_HOST_OPTIONS` vs `MAX_OVERLAY_OPTIONS`

* in the overlay config, `LB_SELECTION=1` and `LB_SELECTION_MAGLEV=1`

Thanks for your input, I would defer this investigation to datapath experts (e.g. @gentoo-root or other datapath team member), as it might take more time for newbie like myself :sweat_smile:.

PS: Kudos to LVH team with awesome steps, local feedback loop is superb.