Closed darkhonor closed 1 month ago
interesting. Let test later today. It is possible that the rancher helm chart is borked. There is a possible workaround with a registries.yaml on each of the node. Let check the chart first.
To maybe help, here are some of my configs that I'm using with the airgap registry:
/etc/rancher/rke2/registries.yaml:
mirrors:
"*":
endpoint:
- https://harbor.kten.test/library
On Hauler system, here's the Hauler config I created to get the TLS working. Again, if I didn't have this, RKE2 and cert-manager would deploy with no issues. But rancher wouldn't.
version: 0.1
log:
fields:
service: registry
storage:
cache:
blobdescriptor: inmemory
filesystem:
rootdirectory: /var/lib/registry
http:
addr: 192.168.51.99:5000
net: tcp
host: https://192.168.51.99:5000
secret: NotAGr3@tS3creT!
relativeurls: false
tls:
certificate: /etc/registry/registry.crt
key: /etc/registry/registry.key
headers:
X-Content-Type-Options: [nosniff]
health:
storagedriver:
enabled: true
interval: 10s
threshold: 3
Not looking at any kind of authentication. Just need it to be encrypted. Here is the log from my latest attempt using the HTTPS Harbor registry:
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=warning msg="Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2>
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=warning msg="Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2>
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Using private registry config file at /etc/rancher/rke2/registries.yaml"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Pulling runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=fatal msg="chmod /var/lib/rancher/rke2/data/v1.28.11-rke2r1-b8960a847dfd/bin: no such file or directory"
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Jul 11 07:35:37 rke2-control-01 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Consumed 3.659s CPU time.
Jul 11 07:35:42 rke2-control-01 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 1.
Jul 11 07:35:42 rke2-control-01 systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 11 07:35:42 rke2-control-01 systemd[1]: rke2-server.service: Consumed 3.659s CPU time.
Finally, here are the commands I'm calling:
/usr/local/bin/helm upgrade -i cert-manager oci://harbor.kten.test/library/hauler/cert-manager --version ${cert-manager-version} --kubeconfig /etc/rancher/rke2/rke2.yaml --namespace cert-manager --create-namespace --set crds.enabled=true
/usr/local/bin/helm upgrade -i rancher oci://harbor.kten.test/library/hauler/rancher --namespace cattle-system --create-namespace --version ${rancher_version} --kubeconfig /etc/rancher/rke2/rke2.yaml --set bootstrapPassword=${rancher_bootstrap_password} --set replicas=1 --set auditLog.level=2 --set auditLog.destination=hostPath --set useBundledSystemChart=true --set systemDefaultRegistry=harbor.kten.test/library --set hostname=${rancher_hostname}
These are embedded in a cloud-init userdata, which is why I have to pass the KUBECONFIG variable. Thank you!
A couple of things I see.
Here is my etc/rancher/rke2/registries.yaml
notice the http calls.
mirrors:
docker.io:
endpoint:
- http://192.168.1.29:5000
192.168.1.29:5000:
endpoint:
- http://192.168.1.29:5000
This would be coupled with --plain-http
at the end of the helm chart.
I am going to test in the morning an newer version of the script to clean up some of the helm commands. I found that with the registries I did not need to add all the image locations with helm.
I'll retry that again. I did have the following previously with the --plain-http
flag on the Helm request when I was using Hauler to seed.
mirrors:
"*":
endpoint:
- http://192.168.1.29:5000
However, I did just see this issue opened on the Rancher GitHub Issues for both 2.8 and 2.9 that could be related: air gap RKE2 downstream cluster fails to pull images if the registry mirrors endpoint does not contain a schema. I'll test again and let you know.
Thank you!
That is for downstream clusters not being able to use the embeded charts. Hit a few customers. Are you building the "local" cluster or downstream?
The originating local one to seed our environment.
I made the adjustments to the template used for both the server and worker nodes for /etc/rancher/rke2/registries.yaml
:
mirrors:
docker.io:
endpoint:
- http://${helm_archive_server_ip}:5000
${helm_archive_server_ip}:5000:
endpoint:
- http://${helm_archive_server_ip}:5000
"*":
endpoint:
- http://${helm_archive_server_ip}:5000
RKE2 deploys and the cluster initiates. The cloud-init script to install cert-manager and metallb that I've built in works great and the first rancher/rancher pod starts up. The problem is when the next helm-operation pod tries to pull. Here is the list of events from the ErrImagePull
pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 55s default-scheduler Successfully assigned cattle-system/helm-operation-hgds7 to rke2-worker-01
Normal BackOff <invalid> (x5 over <invalid>) kubelet Back-off pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
Warning Failed <invalid> (x5 over <invalid>) kubelet Error: ImagePullBackOff
Normal BackOff <invalid> (x3 over <invalid>) kubelet Back-off pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
Warning Failed <invalid> (x3 over <invalid>) kubelet Error: ImagePullBackOff
Normal Pulling <invalid> (x3 over <invalid>) kubelet Pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
Warning Failed <invalid> (x3 over <invalid>) kubelet Failed to pull image "192.168.51.99:5000/rancher/shell:v0.1.24": failed to pull and unpack image "192.168.51.99:5000/rancher/shell:v0.1.24": failed to resolve reference "192.168.51.99:5000/rancher/shell:v0.1.24": failed to do request: Head "https://192.168.51.99:5000/v2/rancher/shell/manifests/v0.1.24": http: server gave HTTP response to HTTPS client
Warning Failed <invalid> (x3 over <invalid>) kubelet Error: ErrImagePull
ok I just updated and validated the script. rancher deployed no issue. There is an issue with Neuvector but that should stop you.
Not sure if you add images from other repos but this is what the new regsitries.yaml looks like
mirrors:
docker.io:
endpoint:
- http://192.168.1.198:5000
quay.io:
endpoint:
- http://192.168.1.198:5000
I am checking with the engineers about the wild card.
Can I add you to a private repo with my Terraform and cloud-init files I'm using? I used your script as the baseline for the air gap system setup and I'm using the script to build the store on the Internet side. Going to be going line by line again to see what I missed. But could use another set of eyes?
Sure. Let's chat some more off of GH. My email is clemenko @ gmail.com.
Oh they said wildcard is approved.. :D
Is this still an issue?
Closing due to age. Re-open if needed.
Using the 1.28.11 RKE2 and Rancher 2.8.5 products, we've come across a situation where using Hauler to provide the airgap registry using http works great for the RKE2 server and agent setup. But it fails to install Rancher after the initial rancher/rancher pod starts. The helm-operation activities ignore the private registry setting for an insecure registry and insist on a secure registry.
After generating a TLS certificate for Hauler, rke2-server fails to download the container images from the HTTPS hauler registry (even though the only change to regisitries.yaml is the change from "http" to "https".
How do you overcome and force the Rancher build to stick to the specified registry?
Out of curiosity, I setup a Harbor registry with valid trusted TLS certs and now Rancher won't stop trying to go to index.docker.io/rancher/rke2-runtime.