cilium / cilium-service-mesh-beta

Instructions and issue tracking for Service Mesh capabilities of Cilium
Apache License 2.0
104 stars 14 forks source link

Installation steps using helm #14

Open rverma-dev opened 2 years ago

rverma-dev commented 2 years ago

Is there an existing issue for this?

What happened?

  1. In digital ocean tried to install using
    cilium install --version -service-mesh:v1.11.0-beta.1 --config enable-envoy-config=true --kube-proxy-replacement=probe 

    but getting errors like

    controller endpoint-769-regeneration-recovery is failing since 37s (24x): regeneration recovery failed
  2. Even tried cilium uninstall and simple cilium install --kube-proxy-replacement=probe, but that also gave same error.
  3. Then tried simply
    helm install cilium cilium/cilium \
    --version 1.11.0 \
    --namespace kube-system

    and this went fine.

Cilium Version

1.11.0

Kernel Version

NA

Kubernetes Version

1.21.5

Sysdump

[Uploading cilium-sysdump-20220110-102642.zip…]()

Relevant log output

No response

Anything else?

No response

Code of Conduct

pchaigno commented 2 years ago

Thanks for the report!

Could you try to upload the Cilium sysdump again? It seems you submitted the issue before uploading was finished.

rverma-dev commented 2 years ago

Trial 2: EKS

tried installing using helm

helm upgrade --install cilium cilium/cilium --version=1.11.0 \
             --namespace kube-system --set eni.enabled=true \
             --set ipam.mode=eni --set egressMasqueradeInterfaces=eth0 \
             --set loadBalancer.algorithm=maglev --set hubble.enabled=true  \
             --set hubble.relay.enabled=true --set hubble.ui.enabled=false \
             --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
             --set kubeProxyReplacement="strict" \
             --set k8sServiceHost=$API_SERVER_IP --set k8sServicePort=443 \
             --set-string extraConfig.enable-envoy-config="true" \
             --set image.repository=quay.io/cilium/cilium-service-mesh \
             --set image.tag=v1.11.0-beta.1 \
             --set image.useDigest=false \
             --set operator.image.suffix=-service-mesh \
             --set operator.image.useDigest=false \
             --set operator.replicas=1 \
             --set operator.image.tag=v1.11.0-beta.1

Got below error. Helm need rbac update and there seems other bpf issues too.

level=error msg="Command execution failed" cmd="[tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 1979_next/bpf_host.o sec to-host]" error="exit status 1" subsys=datapath-loader
level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/cilium_calls_hostns_01979': parameter mismatch" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': error reusing pinned map" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': failed to create: Invalid argument(-22)" subsys=datapath-loader
level=warning msg="libbpf: failed to load object '1979_next/bpf_host.o'" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program for host endpoint (to-host)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" file-path=1979_next/bpf_host.o identity=1 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=cilium_host
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 file-path=1979_next_fail identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=40.842791ms bpfWaitForELF="3.806µs" bpfWriteELF="697.761µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="2.285µs" policyCalculation="3.206µs" prepareBuild="623.979µs" proxyConfiguration="7.414µs" proxyPolicyCalculation="2.816µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=43.733597ms waitingForCTClean=201ns waitingForLock=773ns
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="github.com/cilium/cilium/pkg/k8s/watchers/cilium_clusterwide_network_policy.go:93: failed to list *v2.CiliumClusterwideNetworkPolicy: ciliumclusterwidenetworkpolicies.cilium.io is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot list resource \"ciliumclusterwidenetworkpolicies\" in API group \"cilium.io\" at the cluster scope" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/cilium_clusterwide_network_policy.go:93: Failed to watch *v2.CiliumClusterwideNetworkPolicy: failed to list *v2.CiliumClusterwideNetworkPolicy: ciliumclusterwidenetworkpolicies.cilium.io is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot list resource \"ciliumclusterwidenetworkpolicies\" in API group \"cilium.io\" at the cluster scope" subsys=k8s
level=warning msg="Unable to update CiliumNode custom resource" error="ciliumnodes.cilium.io \"ip-192-168-113-75.ec2.internal\" is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot update resource \"ciliumnodes/status\" in API group \"cilium.io\" at the cluster scope" subsys=ipam
level=warning msg="github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:143: failed to list *v1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot list resource \"endpointslices\" in API group \"discovery.k8s.io\" at the cluster scope" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:143: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: endpointslices.discovery.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot list resource \"endpointslices\" in API group \"discovery.k8s.io\" at the cluster scope" subsys=k8s
level=error msg="Command execution failed" cmd="[tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 1979_next/bpf_host.o sec to-host]" error="exit status 1" subsys=datapath-loader
level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/cilium_calls_hostns_01979': parameter mismatch" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': error reusing pinned map" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': failed to create: Invalid argument(-22)" subsys=datapath-loader
level=warning msg="libbpf: failed to load object '1979_next/bpf_host.o'" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program for host endpoint (to-host)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" file-path=1979_next/bpf_host.o identity=1 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=cilium_host
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 file-path=1979_next_fail identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=55.841988ms bpfWaitForELF="4.595µs" bpfWriteELF="745.463µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="2.409µs" policyCalculation="6.682µs" prepareBuild="598.265µs" proxyConfiguration="7.357µs" proxyPolicyCalculation="2.824µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=60.28447ms waitingForCTClean=197ns waitingForLock=836ns
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=error msg="Command execution failed" cmd="[tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 1979_next/bpf_host.o sec to-host]" error="exit status 1" subsys=datapath-loader
level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/cilium_calls_hostns_01979': parameter mismatch" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': error reusing pinned map" subsys=datapath-loader
level=warning msg="libbpf: map 'cilium_calls_hostns_01979': failed to create: Invalid argument(-22)" subsys=datapath-loader
level=warning msg="libbpf: failed to load object '1979_next/bpf_host.o'" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program for host endpoint (to-host)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" file-path=1979_next/bpf_host.o identity=1 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=cilium_host
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 file-path=1979_next_fail identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=62.441743ms bpfWaitForELF="4.154µs" bpfWriteELF="840.205µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="2.732µs" policyCalculation="3.232µs" prepareBuild="795.148µs" proxyConfiguration="7.916µs" proxyPolicyCalculation="3.243µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=66.318589ms waitingForCTClean=208ns waitingForLock="1.053µs"
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1979 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
sealneaward commented 2 years ago

I received the same errors in both installation methods on a 1.21 EKS cluster.

ghouscht commented 2 years ago

Same issue here on a bare-metal cluster. @pchaigno did you get a sysdump? If not I can share one privately with you.

pchaigno commented 2 years ago

@ghouscht I didn't receive a sysdump yet. If you could share one, that would help as it would allow us to confirm this is a complexity issue caused by the lack of kernel support for KPR. I'm pchaigno on Slack as well.

gkjsa commented 2 years ago

same for me on Azure:

cilium install \
    --context xxxxxx \
    --cluster-name xxxxxx \
    --cluster-id 1 \
    --azure-resource-group xxxxxx \
    --azure-subscription-id xxxxx \
    --azure-client-id xxxxx \
    --azure-client-secret xxxxxx \
    --azure-tenant-id xxxxxx \
    --version -service-mesh:v1.11.0-beta.1 \
    --config enable-envoy-config=true \
    --kube-proxy-replacement=probe

results in

cilium-qj48r cilium-agent level=error msg="Command execution failed" cmd="[tc filter replace dev cilium_host ingress prio 1 handle 1 bpf da obj 3826_next/bpf_host.o sec to-host]" error="exit status 1" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/tc//globals/cilium_calls_hostns_03826': parameter mismatch" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="libbpf: map 'cilium_calls_hostns_03826': error reusing pinned map" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="libbpf: map 'cilium_calls_hostns_03826': failed to create: Invalid argument(-22)" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="libbpf: failed to load object '3826_next/bpf_host.o'" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="Unable to load program" subsys=datapath-loader
cilium-qj48r cilium-agent level=warning msg="JoinEP: Failed to load program for host endpoint (to-host)" containerID= datapathPolicyRevision=0 desiredPolicyRevision=2 endpointID=3826 error="Failed to load prog with tc: exit status 1" file-path=3826_next/bpf_host.o identity=1 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=cilium_host
cilium-qj48r cilium-agent level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=2 endpointID=3826 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
cilium-qj48r cilium-agent level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=2 endpointID=3826 file-path=3826_next_fail identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint
cilium-qj48r cilium-agent level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=35.925896ms bpfWaitForELF="3.8µs" bpfWriteELF="769.615µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=2 endpointID=3826 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ mapSync="2.4µs" policyCalculation="3.8µs" prepareBuild="563.111µs" proxyConfiguration="9.301µs" proxyPolicyCalculation="4µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=38.809751ms waitingForCTClean=300ns waitingForLock=900ns
cilium-qj48r cilium-agent level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=2 endpointID=3826 error="Failed to load prog with tc: exit status 1" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint

A previous installation with Cilium 1.11.1 went fine on the same cluster (AKS 1.21.7).

pchaigno commented 2 years ago

cc @jrajahalme

gkjsa commented 2 years ago

I assume this is because of reusing AKS clusters that have been in clustermesh mode previously. Clustermesh has been disabled before but probably some settings still exist.

Jiang1155 commented 2 years ago

I had similar issue. I got following errors:

2022-06-25T00:32:12.685918484Z level=error msg="Command execution failed" cmd="[ip -force link set dev eth0 xdpgeneric obj /var/run/cilium/state/bpf_xdp.o sec from-netdev]" error="exit status 255" subsys=datapath-loader
2022-06-25T00:32:12.685965305Z level=warning msg="libbpf: couldn't reuse pinned map at '/sys/fs/bpf/xdp//globals/cilium_calls_xdp': parameter mismatch" subsys=datapath-loader
2022-06-25T00:32:12.685972334Z level=warning msg="libbpf: map 'cilium_calls_xdp': error reusing pinned map" subsys=datapath-loader
2022-06-25T00:32:12.685977404Z level=warning msg="libbpf: map 'cilium_calls_xdp': failed to create: Invalid argument(-22)" subsys=datapath-loader
2022-06-25T00:32:12.685981737Z level=warning msg="libbpf: failed to load object '/var/run/cilium/state/bpf_xdp.o'" subsys=datapath-loader
2022-06-25T00:32:12.694436571Z level=fatal msg="Failed to compile XDP program" error="Failed to load prog with ip: exit status 255" subsys=datapath-loader
2022-06-25T00:32:14.062967388Z level=info msg="regenerating all endpoints" reason="kube-apiserver identity updated" subsys=endpoint-manager

This happened when I downgraded cilium from a newer version to old v1.11.1. And only happened when I enable xdp via bpf-lb-acceleration: testing-only

I have two nodes. I reload one node and it recovered from the error. I tried to get sysdump ( I guess now it calls debuginfo?) . But I can only get it from the recovered cilium pod. For the crashing one, I cannot get it since it keeps crashing. I uploaded the file here anyway.

para-mismatch.log