clemenko / rke_airgap_install

a script/method for air gapping the Rancher Stack with Hauler
50 stars 25 forks source link

Enabling Monitoring in airgapped environment #25

Open valentin-nasta opened 1 month ago

valentin-nasta commented 1 month ago

I was checking the possibility to enable the Monitoring Application in an airgapped environment according to this documentation: https://ranchermanager.docs.rancher.com/how-to-guides/advanced-user-guides/monitoring-alerting-guides/enable-monitoring

Is this a separate Helm chart, or is it part of the existing stack? How would this setup look when integrated inside the rke_airgap_install script? Any tips or guidance on configuring this in an airgapped environment would be greatly appreciated.

Thank you!

clemenko commented 1 month ago

By default or script? I think the images may already be there. I need to test a cluster tonight/tomorrow. Actually it should bne fairly easy to add to the script.

clemenko commented 1 month ago

So good news. All the images are included already. I was able to go into rancher and use the catalog for Monitoring and everything worked.

From https://github.com/clemenko/rke_airgap_install/blob/main/hauler_all_the_things.sh#L489 --set useBundledSystemChart=true tells rancher to use the charts locally. And since all the images are already stored in hauler everything works.

is there something more that you are looking for?

valentin-nasta commented 1 month ago

Thank you for the quick reply.

By default or script?

By default would be nice, if there is some kind of Rancher activation of the monitoring similar to the govmessage. Otherwise, adding it to the script would also work fine. The scenario is to have the system already prepared and delivered to the customer without needing to fiddle with the setup afterward.

I also discovered which Helm chart is actually being used by inspecting the UI (rancher-monitoring-103.1.1-up45.31.1.tgz). Initially, I thought it was this one: kube-prometheus-stack.

I tried installing it "manually," but it fails. Do you have any idea why this might happen?

helm upgrade --install=true --namespace=cattle-monitoring-system --timeout=10m0s --values=/home/shell/helm/values-rancher-monitoring-103.1.1-up45.31.1.yaml --version=103.1.1+up45.31.1 --wait=true rancher-monitoring /home/shell/helm/rancher-monitoring-103.1.1-up45.31.1.tgz
Release "rancher-monitoring" does not exist. Installing it now.
Starting delete for "rancher-monitoring-admission" ServiceAccount
Ignoring delete failure for "rancher-monitoring-admission" /v1, Kind=ServiceAccount: serviceaccounts "rancher-monitoring-admission" not found
creating 1 resource(s)
Starting delete for "rancher-monitoring-admission" ClusterRole
Ignoring delete failure for "rancher-monitoring-admission" rbac.authorization.k8s.io/v1, Kind=ClusterRole: clusterroles.rbac.authorization.k8s.io "rancher-monitoring-admission" not found
creating 1 resource(s)
Starting delete for "rancher-monitoring-admission" ClusterRoleBinding
Ignoring delete failure for "rancher-monitoring-admission" rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding: clusterrolebindings.rbac.authorization.k8s.io "rancher-monitoring-admission" not found
creating 1 resource(s)
Starting delete for "rancher-monitoring-admission" Role
Ignoring delete failure for "rancher-monitoring-admission" rbac.authorization.k8s.io/v1, Kind=Role: roles.rbac.authorization.k8s.io "rancher-monitoring-admission" not found
creating 1 resource(s)
Starting delete for "rancher-monitoring-admission" RoleBinding
Ignoring delete failure for "rancher-monitoring-admission" rbac.authorization.k8s.io/v1, Kind=RoleBinding: rolebindings.rbac.authorization.k8s.io "rancher-monitoring-admission" not found
creating 1 resource(s)
Starting delete for "rancher-monitoring-admission-create" Job
Ignoring delete failure for "rancher-monitoring-admission-create" batch/v1, Kind=Job: jobs.batch "rancher-monitoring-admission-create" not found
creating 1 resource(s)
Watching for changes to Job rancher-monitoring-admission-create with timeout of 10m0s
Add/Modify event for rancher-monitoring-admission-create: ADDED
rancher-monitoring-admission-create: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
Add/Modify event for rancher-monitoring-admission-create: MODIFIED
rancher-monitoring-admission-create: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: failed pre-install: 1 error occurred:
        * timed out waiting for the condition

full log: helm-operation-v68w4_undefined.log

valentin-nasta commented 1 month ago

I think found the root cause on the error:

kubectl -n cattle-monitoring-system get job rancher-monitoring-admission-create
NAME                                  COMPLETIONS   DURATION   AGE
rancher-monitoring-admission-create   0/1           93m        93m

kubectl -n cattle-monitoring-system get pod --selector=job-name=rancher-monitoring-admission-create
NAME                                        READY   STATUS             RESTARTS   AGE
rancher-monitoring-admission-create-snvlv   0/1     ImagePullBackOff   0          91m

kubectl -n cattle-monitoring-system get pod --selector=job-name=rancher-monitoring-admission-create -oyaml | grep image
      image: 192.168.100.107:5000/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
      imagePullPolicy: IfNotPresent
    - image: 192.168.100.107:5000/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6
      imageID: ""
          message: Back-off pulling image "192.168.100.107:5000/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6"
clemenko commented 1 month ago

If you want the https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack you will have to add the images itself. Right now out of the box, all the images you need is there for the Rancher Monitoring App. You can deploy Rancher and install from the catalog. I will look into adding it from a curl shortly.

clemenko commented 1 month ago

After looking into this there is no easy way to do this. The charts they are using are backed in. The chart versions are also hard coded. The simplest way is to use the GUI for deploying it.

valentin-nasta commented 1 month ago

Thank you for taking a look on it! Even using the GUI it felt short with the error from the previous comment. I need to troubleshoot it and make sure to load the images beforehand.

clemenko commented 1 month ago

I was not able to reproduce the error. Did you deploy rancher with the script?

valentin-nasta commented 1 month ago

Yes, I deployed rancher with the script, with these versions:

export RKE_VERSION=1.28.12
export CERT_VERSION=v1.15.3
export RANCHER_VERSION=v2.8.5
export LONGHORN_VERSION=v1.7.0
export NEU_VERSION=2.7.7

I think am getting closer, there is some version mismatch somewhere:

hauler store info | grep mirrored-ingress-nginx-kube

| rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20230312-helm-chart-4.5.2-28-g66a760794 | image | linux/amd64 |        2 | 20.1 MB  |

vs

message: Back-off pulling image "192.168.100.107:5000/rancher/mirrored-ingress-nginx-kube-webhook-certgen:v20221220-controller-v1.5.1-58-g787ea74b6"
clemenko commented 1 month ago

I think I know what is going on. Updating the script now for it.

clemenko commented 1 month ago

Take a look at https://github.com/clemenko/rke_airgap_install/commit/64054ec774bc913e7524b277525412aa6e494d9f