clemenko / rke_airgap_install

a script/method for air gapping the Rancher Stack with Hauler
50 stars 25 forks source link

Rancher install insists on https registry; breaks rke2-server setup #19

Closed darkhonor closed 1 month ago

darkhonor commented 4 months ago

Using the 1.28.11 RKE2 and Rancher 2.8.5 products, we've come across a situation where using Hauler to provide the airgap registry using http works great for the RKE2 server and agent setup. But it fails to install Rancher after the initial rancher/rancher pod starts. The helm-operation activities ignore the private registry setting for an insecure registry and insist on a secure registry.

After generating a TLS certificate for Hauler, rke2-server fails to download the container images from the HTTPS hauler registry (even though the only change to regisitries.yaml is the change from "http" to "https".

How do you overcome and force the Rancher build to stick to the specified registry?

Out of curiosity, I setup a Harbor registry with valid trusted TLS certs and now Rancher won't stop trying to go to index.docker.io/rancher/rke2-runtime.

clemenko commented 4 months ago

interesting. Let test later today. It is possible that the rancher helm chart is borked. There is a possible workaround with a registries.yaml on each of the node. Let check the chart first.

darkhonor commented 4 months ago

To maybe help, here are some of my configs that I'm using with the airgap registry:

/etc/rancher/rke2/registries.yaml:

mirrors:
  "*":
    endpoint:
      - https://harbor.kten.test/library

On Hauler system, here's the Hauler config I created to get the TLS working. Again, if I didn't have this, RKE2 and cert-manager would deploy with no issues. But rancher wouldn't.

version: 0.1
log:
  fields:
    service: registry
storage:
  cache:
    blobdescriptor: inmemory
  filesystem:
    rootdirectory: /var/lib/registry
http:
  addr: 192.168.51.99:5000
  net: tcp
  host: https://192.168.51.99:5000
  secret: NotAGr3@tS3creT!
  relativeurls: false
  tls:
    certificate: /etc/registry/registry.crt
    key: /etc/registry/registry.key
  headers:
    X-Content-Type-Options: [nosniff]
health:
  storagedriver:
    enabled: true
    interval: 10s
    threshold: 3

Not looking at any kind of authentication. Just need it to be encrypted. Here is the log from my latest attempt using the HTTPS Harbor registry:

Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=warning msg="Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2>
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=warning msg="Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.28.11-rke2>
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Using private registry config file at /etc/rancher/rke2/registries.yaml"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=info msg="Pulling runtime image index.docker.io/rancher/rke2-runtime:v1.28.11-rke2r1"
Jul 11 07:35:37 rke2-control-01 rke2[3308]: time="2024-07-11T07:35:37Z" level=fatal msg="chmod /var/lib/rancher/rke2/data/v1.28.11-rke2r1-b8960a847dfd/bin: no such file or directory"
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Jul 11 07:35:37 rke2-control-01 systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
Jul 11 07:35:37 rke2-control-01 systemd[1]: rke2-server.service: Consumed 3.659s CPU time.
Jul 11 07:35:42 rke2-control-01 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 1.
Jul 11 07:35:42 rke2-control-01 systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 11 07:35:42 rke2-control-01 systemd[1]: rke2-server.service: Consumed 3.659s CPU time.

Finally, here are the commands I'm calling:

/usr/local/bin/helm upgrade -i cert-manager oci://harbor.kten.test/library/hauler/cert-manager --version ${cert-manager-version} --kubeconfig /etc/rancher/rke2/rke2.yaml --namespace cert-manager --create-namespace --set crds.enabled=true
/usr/local/bin/helm upgrade -i rancher oci://harbor.kten.test/library/hauler/rancher --namespace cattle-system --create-namespace --version ${rancher_version} --kubeconfig /etc/rancher/rke2/rke2.yaml --set bootstrapPassword=${rancher_bootstrap_password} --set replicas=1 --set auditLog.level=2 --set auditLog.destination=hostPath --set useBundledSystemChart=true --set systemDefaultRegistry=harbor.kten.test/library --set hostname=${rancher_hostname}

These are embedded in a cloud-init userdata, which is why I have to pass the KUBECONFIG variable. Thank you!

clemenko commented 4 months ago

A couple of things I see. Here is my etc/rancher/rke2/registries.yaml notice the http calls.

mirrors:
  docker.io:
    endpoint:
      - http://192.168.1.29:5000
  192.168.1.29:5000:
    endpoint:
      - http://192.168.1.29:5000

This would be coupled with --plain-http at the end of the helm chart.

I am going to test in the morning an newer version of the script to clean up some of the helm commands. I found that with the registries I did not need to add all the image locations with helm.

darkhonor commented 4 months ago

I'll retry that again. I did have the following previously with the --plain-http flag on the Helm request when I was using Hauler to seed.

mirrors:
  "*":
    endpoint:
      - http://192.168.1.29:5000

However, I did just see this issue opened on the Rancher GitHub Issues for both 2.8 and 2.9 that could be related: air gap RKE2 downstream cluster fails to pull images if the registry mirrors endpoint does not contain a schema. I'll test again and let you know.

Thank you!

clemenko commented 4 months ago

That is for downstream clusters not being able to use the embeded charts. Hit a few customers. Are you building the "local" cluster or downstream?

darkhonor commented 4 months ago

The originating local one to seed our environment.

darkhonor commented 4 months ago

I made the adjustments to the template used for both the server and worker nodes for /etc/rancher/rke2/registries.yaml:

mirrors:
  docker.io:
    endpoint:
      - http://${helm_archive_server_ip}:5000
  ${helm_archive_server_ip}:5000:
    endpoint:
      - http://${helm_archive_server_ip}:5000
  "*":
    endpoint:
      - http://${helm_archive_server_ip}:5000

RKE2 deploys and the cluster initiates. The cloud-init script to install cert-manager and metallb that I've built in works great and the first rancher/rancher pod starts up. The problem is when the next helm-operation pod tries to pull. Here is the list of events from the ErrImagePull pod:

Events:
  Type     Reason     Age                            From               Message
  ----     ------     ----                           ----               -------
  Normal   Scheduled  55s                            default-scheduler  Successfully assigned cattle-system/helm-operation-hgds7 to rke2-worker-01
  Normal   BackOff    <invalid> (x5 over <invalid>)  kubelet            Back-off pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
  Warning  Failed     <invalid> (x5 over <invalid>)  kubelet            Error: ImagePullBackOff
  Normal   BackOff    <invalid> (x3 over <invalid>)  kubelet            Back-off pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
  Warning  Failed     <invalid> (x3 over <invalid>)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    <invalid> (x3 over <invalid>)  kubelet            Pulling image "192.168.51.99:5000/rancher/shell:v0.1.24"
  Warning  Failed     <invalid> (x3 over <invalid>)  kubelet            Failed to pull image "192.168.51.99:5000/rancher/shell:v0.1.24": failed to pull and unpack image "192.168.51.99:5000/rancher/shell:v0.1.24": failed to resolve reference "192.168.51.99:5000/rancher/shell:v0.1.24": failed to do request: Head "https://192.168.51.99:5000/v2/rancher/shell/manifests/v0.1.24": http: server gave HTTP response to HTTPS client
  Warning  Failed     <invalid> (x3 over <invalid>)  kubelet            Error: ErrImagePull
clemenko commented 4 months ago

ok I just updated and validated the script. rancher deployed no issue. There is an issue with Neuvector but that should stop you.

Not sure if you add images from other repos but this is what the new regsitries.yaml looks like

mirrors:
  docker.io:
    endpoint:
      - http://192.168.1.198:5000
  quay.io:
    endpoint:
      - http://192.168.1.198:5000

I am checking with the engineers about the wild card.

darkhonor commented 4 months ago

Can I add you to a private repo with my Terraform and cloud-init files I'm using? I used your script as the baseline for the air gap system setup and I'm using the script to build the store on the Internet side. Going to be going line by line again to see what I missed. But could use another set of eyes?

clemenko commented 4 months ago

Sure. Let's chat some more off of GH. My email is clemenko @ gmail.com.

Oh they said wildcard is approved.. :D

clemenko commented 2 months ago

Is this still an issue?

clemenko commented 1 month ago

Closing due to age. Re-open if needed.