Icinga / helm-charts

Kubernetes Helm charts to deploy a ready-to-use Icinga monitoring stack.
https://icinga.com
Apache License 2.0
9 stars 13 forks source link

[Bug]: can't retry: can't perform "INSERT INTO `pod_owner` #56

Open ngoeddel-openi opened 4 months ago

ngoeddel-openi commented 4 months ago

Affected Chart

icinga-stack

Which version of the app contains the bug?

0.3.0

Please describe your problem

Actually I am using my fork here: https://github.com/open-i-gmbh/icinga-helm-charts

But at the moment there are only minor changes locally on my machine because I am trying to get HA working and parent zones and satellites and all that good stuff.

Anyway. The bug I encountered comes from the icinga-kubernetes Subchart. It deploys fine but the Pod is not getting healthy. This is what the Pod shows:

E0625 10:57:32.777860       1 runtime.go:79] Observed a panic: "send on closed channel" (send on closed channel)
goroutine 6316 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x18aeca0, 0x1db96e0})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0008c8e00?})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x18aeca0?, 0x1db96e0?})
    /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/icinga/icinga-kubernetes/pkg/database.(*hasMany[...]).StreamInto(0x1dbc300?, {0x1dd7a30, 0xc0038a3c70}, 0xc003d38960)
    /build/pkg/database/relations.go:76 +0x125
github.com/icinga/icinga-kubernetes/pkg/database.(*Database).UpsertStreamed.func3.1()
    /build/pkg/database/database.go:572 +0xa5
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 3647
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x96
panic: send on closed channel [recovered]
    panic: send on closed channel

goroutine 6316 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0008c8e00?})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:56 +0xcd

And it also shows this a lot of times:

F0625 10:40:08.672499       1 main.go:204] can't retry: can't perform "INSERT INTO `pod_owner` (`controller`, `name`, `pod_uuid`, `owner_uuid`, `uid`, `block_owner_deletion`, `kind`) VALUES (:controller, :name, :pod_uuid, :owner_uuid, :uid, :block_owner_deletion, :kind) ON DUPLICATE KEY UPDATE `controller` = VALUES(`controller`), `name` = VALUES(`name`), `pod_uuid` = VALUES(`pod_uuid`), `owner_uuid` = VALUES(`owner_uuid`), `uid` = VALUES(`uid`), `block_owner_deletion` = VALUES(`block_owner_deletion`), `kind` = VALUES(`kind`)": Error 1265 (01000): Data truncated for column 'kind' at row 11

And the database pod shows this:

<...>
2024-06-25 10:57:32 1219 [Warning] Aborted connection 1219 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 10:57:32 1231 [Warning] Aborted connection 1231 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 10:57:32 1232 [Warning] Aborted connection 1232 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 10:57:32 1223 [Warning] Aborted connection 1223 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 11:00:29 1282 [Warning] Aborted connection 1282 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 11:00:29 1288 [Warning] Aborted connection 1288 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 11:00:29 1275 [Warning] Aborted connection 1275 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 11:00:29 1287 [Warning] Aborted connection 1287 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)
2024-06-25 11:00:29 1252 [Warning] Aborted connection 1252 to db: 'kubernetes' user: 'icinga' host: '10.42.22.220' (Got an error reading communication packets)

On the other hand it seemed to be able to create all the necessary tables in the database and even in pod_owner there are a lot of entries. But the Pod still does not get healthy and restarts itself the whole time.

This is my values.yaml:

icinga2:
  replicas: 1
  features:
    icingadb:
      enabled: false
  config:
    is_master: true
    zone_name: o-mgmt-zone
    create_endpoints: true
    ticket_salt:
      value: abcdefghijklmnopqrstuvwxyz
      credSecret: # Or use existing secret
      secretKey:
    disable_confd: true
    endpoints:
      - name: o-dev-icinga-1
        host: o-dev-icinga-1
      - name: o-dev-icinga-2
        host: o-dev-icinga-2
    zones:
      - name: o-dev-zone
        parent: o-mgmt-zone
        endpoints:
          - o-dev-icinga-1
          - o-dev-icinga-2

  persistence:
    enabled: true

icingadb:
  enabled: false

icingaweb2:
  enabled: false

global:
  api:
    users:
      director:
        enabled: false
      icingaweb:
        enabled: false
  databases:
    director:
      enabled: false
    icingadb:
      enabled: false
    icingaweb2:
      enabled: false
    kubernetes:
      password:
        value: icinga
      username:
        value: icinga
      persistence:
        enabled: true
    redis:
      enabled: false

Just ignore the config for icinga2 because I changed a lot there.

lippserd commented 4 months ago

Hi @ngoeddel-openi,

Could you please run the following command and share its output?

kubectl get pods -o 'custom-columns=OWNER:.metadata.ownerReferences[0].kind' -A

This will list all pod owner types. I'm quite sure our database schema is too strict and missing a possible type.

Best regards, Eric

ngoeddel-openi commented 4 months ago

Sure, I also added a | sort -u to make the list shorter:

$ kubectl get pods -o 'custom-columns=OWNER:.metadata.ownerReferences[0].kind' --no-headers -A | sort -u
Cluster
DaemonSet
InstanceManager
Job
Node
<none>
ReplicaSet
ShareManager
StatefulSet
lippserd commented 4 months ago

Nice, thanks for the quick reply.

lippserd commented 4 months ago

Cluster, InstanceManager and ShareManager look like custom resource definitions to me. Can you confirm that? You may also run and share kubectl get crds.

lippserd commented 4 months ago

Anyway, you may fix that by executing the following statement in the Icinga for Kubernetes database:

ALTER TABLE pod_owner MODIFY kind varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL;

ngoeddel-openi commented 4 months ago

Cluster, InstanceManager and ShareManager look like custom resource definitions to me. Can you confirm that? You may also run and share kubectl get crds.

Exactly.

And we definitely have more custom resources in other Kubernetes clusters. Currently I am only testing against our DEV cluster.

Here are all the CRDs we have here right now:

$ kubectl get crds
NAME                                                  CREATED AT
addons.k3s.cattle.io                                  2024-03-04T15:26:22Z
alertmanagerconfigs.monitoring.coreos.com             2024-03-08T10:08:41Z
alertmanagers.monitoring.coreos.com                   2024-03-08T10:08:42Z
alerts.notification.toolkit.fluxcd.io                 2024-03-06T10:24:09Z
apiservers.operator.tigera.io                         2024-03-04T15:26:57Z
backingimagedatasources.longhorn.io                   2024-03-08T10:25:13Z
backingimagemanagers.longhorn.io                      2024-03-08T10:25:13Z
backingimages.longhorn.io                             2024-03-08T10:25:13Z
backupbackingimages.longhorn.io                       2024-03-20T11:53:50Z
backups.longhorn.io                                   2024-03-08T10:25:13Z
backups.postgresql.cnpg.io                            2024-03-08T09:10:15Z
backuptargets.longhorn.io                             2024-03-08T10:25:13Z
backupvolumes.longhorn.io                             2024-03-08T10:25:13Z
bgpconfigurations.crd.projectcalico.org               2024-03-04T15:26:57Z
bgpfilters.crd.projectcalico.org                      2024-03-07T14:04:57Z
bgppeers.crd.projectcalico.org                        2024-03-04T15:26:57Z
blockaffinities.crd.projectcalico.org                 2024-03-04T15:26:57Z
buckets.source.toolkit.fluxcd.io                      2024-03-06T10:24:09Z
caliconodestatuses.crd.projectcalico.org              2024-03-04T15:26:57Z
certificaterequests.cert-manager.io                   2024-03-08T09:10:21Z
certificates.cert-manager.io                          2024-03-08T09:10:21Z
challenges.acme.cert-manager.io                       2024-03-08T09:10:21Z
clusterinformations.crd.projectcalico.org             2024-03-04T15:26:57Z
clusterissuers.cert-manager.io                        2024-03-08T09:10:21Z
clusters.postgresql.cnpg.io                           2024-03-08T09:10:15Z
engineimages.longhorn.io                              2024-03-08T10:25:13Z
engines.longhorn.io                                   2024-03-08T10:25:13Z
etcdsnapshotfiles.k3s.cattle.io                       2024-03-07T14:02:08Z
felixconfigurations.crd.projectcalico.org             2024-03-04T15:26:57Z
gitrepositories.source.toolkit.fluxcd.io              2024-03-06T10:24:09Z
globalnetworkpolicies.crd.projectcalico.org           2024-03-04T15:26:57Z
globalnetworksets.crd.projectcalico.org               2024-03-04T15:26:57Z
helmchartconfigs.helm.cattle.io                       2024-03-04T15:26:22Z
helmcharts.helm.cattle.io                             2024-03-04T15:26:22Z
helmcharts.source.toolkit.fluxcd.io                   2024-03-06T10:24:09Z
helmreleases.helm.toolkit.fluxcd.io                   2024-03-06T10:24:09Z
helmrepositories.source.toolkit.fluxcd.io             2024-03-06T10:24:09Z
hostendpoints.crd.projectcalico.org                   2024-03-04T15:26:57Z
imagepolicies.image.toolkit.fluxcd.io                 2024-03-06T10:24:09Z
imagerepositories.image.toolkit.fluxcd.io             2024-03-06T10:24:09Z
imagesets.operator.tigera.io                          2024-03-04T15:26:57Z
imageupdateautomations.image.toolkit.fluxcd.io        2024-03-06T10:24:09Z
installations.operator.tigera.io                      2024-03-04T15:26:58Z
instancemanagers.longhorn.io                          2024-03-08T10:25:13Z
ipamblocks.crd.projectcalico.org                      2024-03-04T15:26:57Z
ipamconfigs.crd.projectcalico.org                     2024-03-04T15:26:57Z
ipamhandles.crd.projectcalico.org                     2024-03-04T15:26:57Z
ippools.crd.projectcalico.org                         2024-03-04T15:26:57Z
ipreservations.crd.projectcalico.org                  2024-03-04T15:26:57Z
issuers.cert-manager.io                               2024-03-08T09:10:21Z
kubecontrollersconfigurations.crd.projectcalico.org   2024-03-04T15:26:57Z
kustomizations.kustomize.toolkit.fluxcd.io            2024-03-06T10:24:09Z
networkpolicies.crd.projectcalico.org                 2024-03-04T15:26:57Z
networksets.crd.projectcalico.org                     2024-03-04T15:26:57Z
nodes.longhorn.io                                     2024-03-08T10:25:13Z
nvadmissioncontrolsecurityrules.neuvector.com         2024-05-07T08:06:37Z
nvclustersecurityrules.neuvector.com                  2024-05-07T08:06:37Z
nvcomplianceprofiles.neuvector.com                    2024-05-07T08:06:37Z
nvdlpsecurityrules.neuvector.com                      2024-05-07T08:06:37Z
nvsecurityrules.neuvector.com                         2024-05-07T08:06:37Z
nvvulnerabilityprofiles.neuvector.com                 2024-05-07T08:06:37Z
nvwafsecurityrules.neuvector.com                      2024-05-07T08:06:37Z
ocirepositories.source.toolkit.fluxcd.io              2024-03-06T10:24:09Z
opensearchclusters.opensearch.opster.io               2024-03-08T09:10:14Z
opensearchroles.opensearch.opster.io                  2024-03-08T09:10:14Z
opensearchuserrolebindings.opensearch.opster.io       2024-03-08T09:10:14Z
opensearchusers.opensearch.opster.io                  2024-03-08T09:10:14Z
orders.acme.cert-manager.io                           2024-03-08T09:10:21Z
orphans.longhorn.io                                   2024-03-08T10:25:13Z
podmonitors.monitoring.coreos.com                     2024-03-08T10:08:42Z
poolers.postgresql.cnpg.io                            2024-03-08T09:10:15Z
probes.monitoring.coreos.com                          2024-03-08T10:08:42Z
prometheuses.monitoring.coreos.com                    2024-03-08T10:08:42Z
prometheusrules.monitoring.coreos.com                 2024-03-08T10:08:42Z
providers.notification.toolkit.fluxcd.io              2024-03-06T10:24:09Z
receivers.notification.toolkit.fluxcd.io              2024-03-06T10:24:09Z
recurringjobs.longhorn.io                             2024-03-08T10:25:13Z
replicas.longhorn.io                                  2024-03-08T10:25:13Z
scheduledbackups.postgresql.cnpg.io                   2024-03-08T09:10:15Z
servicemonitors.monitoring.coreos.com                 2024-03-08T10:08:42Z
settings.longhorn.io                                  2024-03-08T10:25:13Z
sharemanagers.longhorn.io                             2024-03-08T10:25:13Z
snapshots.longhorn.io                                 2024-03-08T10:25:13Z
supportbundles.longhorn.io                            2024-03-08T10:25:13Z
systembackups.longhorn.io                             2024-03-08T10:25:13Z
systemrestores.longhorn.io                            2024-03-08T10:25:13Z
thanosrulers.monitoring.coreos.com                    2024-03-08T10:08:43Z
tigerastatuses.operator.tigera.io                     2024-03-04T15:26:57Z
volumeattachments.longhorn.io                         2024-03-08T10:25:13Z
volumes.longhorn.io                                   2024-03-08T10:25:13Z
volumesnapshotclasses.snapshot.storage.k8s.io         2024-03-07T14:05:11Z
volumesnapshotcontents.snapshot.storage.k8s.io        2024-03-07T14:05:11Z
volumesnapshots.snapshot.storage.k8s.io               2024-03-07T14:05:11Z

And I will soon run the ALTER TABLE command and write you back.

ngoeddel-openi commented 4 months ago

After the ALTER TABLE statement the pod seems to work fine for a while. But after a few minutes I got this now:

I0626 09:06:05.647119       1 database.go:285] "Connecting to database" logger="database"
I0626 09:06:07.883124       1 request.go:697] Waited for 1.126928059s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/api/v1/namespaces/cattle-monitoring-system/pods/pushprox-kube-controller-manager-client-qtg8v/log?container=pushprox-client
I0626 09:06:17.883325       1 request.go:697] Waited for 10.135795173s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/api/v1/namespaces/kube-system/pods/etcd-elefant-d-kubm02p/log?container=etcd
E0626 09:07:15.600518       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x18681e0), concrete:(*abi.Type)(0x1785860), asserted:(*abi.Type)(0x1a59d40), missingMethod:""} (interface conversion: interface {} is []uint8, not types.UUID)
goroutine 756 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x18cb900, 0xc005851b60})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1c35998?})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x18cb900?, 0xc005851b60?})
    /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/icinga/icinga-kubernetes/pkg/schema/v1.SyncContainers.func2()
    /build/pkg/schema/v1/container.go:432 +0x796
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 68
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x96
panic: interface conversion: interface {} is []uint8, not types.UUID [recovered]
    panic: interface conversion: interface {} is []uint8, not types.UUID

goroutine 756 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1c35998?})
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x18cb900?, 0xc005851b60?})
    /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/icinga/icinga-kubernetes/pkg/schema/v1.SyncContainers.func2()
    /build/pkg/schema/v1/container.go:432 +0x796
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 68
    /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x96

It looks like a completely different problem though.

lippserd commented 4 months ago

It looks like a completely different problem though.

Yes, I'm working on it. Thanks for testing!

lippserd commented 4 months ago

It looks like a completely different problem though.

Yes, I'm working on it. Thanks for testing!

I pushed some fixes. Could you please pull the image and try again?

ngoeddel-openi commented 4 months ago

I finally got the time to work on this again. After deleting the already existing database and its persistent volume and restarting the icinga-kubernetes deployment it seems to be working. I can see this in the Pod log:

I0712 12:38:20.322918       1 database.go:285] "Connecting to database" logger="database"
I0712 12:38:20.328481       1 driver.go:43] "Can't connect to database. Retrying" logger="database" error="dial tcp 10.43.111.110:3306: connect: connection refused"
I0712 12:39:55.569544       1 driver.go:48] "Reconnected to database" logger="database"
I0712 12:39:55.572964       1 main.go:75] "Importing schema" logger="database"
I0712 12:40:03.681443       1 request.go:697] Waited for 1.005133945s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/api/v1/namespaces/cattle-monitoring-system/pods/pushprox-kube-controller-manager-client-qtg8v/log?container=pushprox-client
I0712 12:40:13.877462       1 request.go:697] Waited for 8.859478226s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/api/v1/namespaces/longhorn-system/pods/engine-image-ei-5cefaf2b-j57fs/log?container=engine-image-ei-5cefaf2b
I0712 12:40:23.877508       1 request.go:697] Waited for 10.741546478s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/api/v1/namespaces/cattle-monitoring-system/pods/pushprox-kube-proxy-client-rf565/log?container=pushprox-client
<...>

From time to time a new log line like the last few is appended to the log and that's it.

However in IcingaWeb I get this error when I try to use the Kubernetes module.

SQLSTATE[42S22]: Column not found: 1054 Unknown column 'node.id' in 'field list'

#0 /usr/share/icinga-php/ipl/vendor/ipl/sql/src/Connection.php(401): PDO->prepare()
#1 /usr/share/icinga-php/ipl/vendor/ipl/sql/src/Connection.php(418): ipl\Sql\Connection->prepexec()
#2 /usr/share/icinga-php/ipl/vendor/ipl/orm/src/Query.php(699): ipl\Sql\Connection->select()
#3 /usr/share/icinga-php/ipl/vendor/ipl/orm/src/ResultSet.php(142): ipl\Orm\Query->yieldResults()
#4 [internal function]: ipl\Orm\ResultSet->yieldTraversable()
#5 /usr/share/icinga-php/ipl/vendor/ipl/orm/src/ResultSet.php(122): Generator->valid()
#6 /usr/share/icinga-php/ipl/vendor/ipl/orm/src/ResultSet.php(114): ipl\Orm\ResultSet->advance()
#7 /usr/share/icingaweb2/modules/kubernetes/library/Kubernetes/Common/BaseItemList.php(63): ipl\Orm\ResultSet->rewind()
#8 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(344): Icinga\Module\Kubernetes\Common\BaseItemList->assemble()
#9 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(566): ipl\Html\HtmlDocument->ensureAssembled()
#10 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(390): ipl\Html\HtmlDocument->render()
#11 /usr/share/icinga-php/ipl/vendor/ipl/html/src/BaseHtmlElement.php(297): ipl\Html\HtmlDocument->renderUnwrapped()
#12 /usr/share/icinga-php/ipl/vendor/ipl/html/src/BaseHtmlElement.php(365): ipl\Html\BaseHtmlElement->renderContent()
#13 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(568): ipl\Html\BaseHtmlElement->renderUnwrapped()
#14 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(390): ipl\Html\HtmlDocument->render()
#15 /usr/share/icinga-php/ipl/vendor/ipl/html/src/HtmlDocument.php(568): ipl\Html\HtmlDocument->renderUnwrapped()
#16 /usr/share/icinga-php/ipl/vendor/ipl/web/src/Compat/ViewRenderer.php(56): ipl\Html\HtmlDocument->render()
#17 /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Controller/Action/Helper/ViewRenderer.php(970): ipl\Web\Compat\ViewRenderer->render()
#18 /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Controller/Action/HelperBroker.php(277): Zend_Controller_Action_Helper_ViewRenderer->postDispatch()
#19 /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Controller/Action.php(527): Zend_Controller_Action_HelperBroker->notifyPostDispatch()
#20 /usr/share/icingaweb2/library/Icinga/Web/Controller/Dispatcher.php(76): Zend_Controller_Action->dispatch()
#21 /usr/share/icinga-php/vendor/vendor/shardj/zf1-future/library/Zend/Controller/Front.php(954): Icinga\Web\Controller\Dispatcher->dispatch()
#22 /usr/share/icingaweb2/library/Icinga/Application/Web.php(294): Zend_Controller_Front->dispatch()
#23 /usr/share/icingaweb2/library/Icinga/Application/webrouter.php(105): Icinga\Application\Web->dispatch()
#24 /usr/share/icingaweb2/public/index.php(4): require_once(String)
#25 {main}

I don't know if it related to the Helm Chart or if I did something wrong.