Open igorrenquin opened 6 months ago
Equipe SRE investigue sur CDTN
Falco, Crowdsec supprimé du DEV 14 juin
semaine du 24/06: Matéo a initié des travaux de déploiement de CDTN dans un nodepool dédié
constat @gary-van-woerkens cdtn est isolé en dev et plus de plantage. Est-ce-que le plantage ne serai pas lie à CDTN + autre chose???
A tester en prod!? GO pour tester en prod
Un reboot de noeud a eu lieu ce matin. Malgré le fait que CDTN ait été bougé sur un noeud dédié (sauf la partie Hasura qui est toujours sur les autres noeuds).
L'isolation de CDTN n'était pas compléte. Une mise à jour va être réalisé.
Encore 3 semaines pour valider.
Tout CDTN est passé sur les nodepool dédiés.
Surveiller le crash des noeuds cdtn Surveiller le noncrash des noeuds core
channel mattermost : https://mattermost.fabrique.social.gouv.fr/default/channels/alertsovh-prod
Roolback de CDTN qui est passé sur les nodespools core + worker
carte-jeune-engage 21m Warning PolicyViolation job/cron-notification-28803840 policy disallow-host-path/autogen-host-path fail: validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-host-path failed at path /spec/template/spec/volumes/0/hostPath/
carte-jeune-engage 32m Warning PolicyViolation cronjob/cron-notification policy disallow-host-path/autogen-cronjob-host-path fail: validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-cronjob-host-path failed at path /spec/jobTemplate/spec/template/spec/volumes/0/hostPath/
ci-data-ia 4m24s Warning SecretSyncError vaultstaticsecret/vault-kubeconfig Failed to update k8s secret: invalid owner label, key=app.kubernetes.io/name, present=false...
ci-data-ia 37m Warning SecretSyncError vaultstaticsecret/vault-kubeconfig Failed to update k8s secret: invalid owner label, key=app.kubernetes.io/component, present=false...
code-du-travail-numerique 2m35s Warning Unhealthy pod/app-78c5ddc857-48w7q Liveness probe failed: Get "http://10.2.199.68:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
default 32m Warning PolicyViolation clusterpolicy/disallow-host-path CronJob carte-jeune-engage/cron-notification: [autogen-cronjob-host-path] fail; validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-cronjob-host-path failed at path /spec/jobTemplate/spec/template/spec/volumes/0/hostPath/
default 21m Warning PolicyViolation clusterpolicy/disallow-host-path Job carte-jeune-engage/cron-notification-28803840: [autogen-host-path] fail; validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-host-path failed at path /spec/template/spec/volumes/0/hostPath/
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 32m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 33m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 32m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 32m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 32m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 32m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 31m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 18m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 17m Warning PolicyError clusterpolicy/fabrique-vpa policy fabrique-vpa/create-vpa error: <nil>
default 21m Warning OOMKilling node/prod-core-nodepool-node-03d66d Memory cgroup out of memory: Killed process 2930908 (celery) total-vm:451576kB, anon-rss:340856kB, file-rss:7976kB, shmem-rss:0kB, UID:1000 pgtables:832kB oom_score_adj:983
default 17m Warning OOMKilling node/prod-core-nodepool-node-03d66d Memory cgroup out of memory: Killed process 1645873 (celery) total-vm:498568kB, anon-rss:383436kB, file-rss:8004kB, shmem-rss:0kB, UID:1000 pgtables:892kB oom_score_adj:983
default 20m Warning OOMKilling node/prod-core-nodepool-node-8a86e9 Memory cgroup out of memory: Killed process 2641421 (postgres) total-vm:431536kB, anon-rss:83004kB, file-rss:57440kB, shmem-rss:132840kB, UID:26 pgtables:764kB oom_score_adj:984
enfants-du-spectacle 5m37s Warning Unhealthy pod/app-ffd755f98-8vdkh Liveness probe failed: Get "http://10.2.198.105:3000/api/healthz": dial tcp 10.2.198.105:3000: connect: connection refused
enfants-du-spectacle 5m37s Warning Unhealthy pod/app-ffd755f98-8vdkh Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state
enfants-du-spectacle 80s Warning FindingCluster backup/pg-test-scheduledbackup-20241005000000 Unknown cluster pg-test, will retry in 30 seconds
enfants-du-spectacle 80s Warning FindingCluster backup/pg-test-scheduledbackup-20241006000000 Unknown cluster pg-test, will retry in 30 seconds
enfants-du-spectacle 96s Warning FindingCluster backup/pg-test-scheduledbackup-20241007000000 Unknown cluster pg-test, will retry in 30 seconds
jardinmental 21m Warning Unhealthy pod/app-59ff494c55-g4p4m Liveness probe failed: Get "http://10.2.35.127:3000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
jardinmental 21m Warning Unhealthy pod/app-59ff494c55-lx6vr Liveness probe failed: Get "http://10.2.23.206:3000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
recosante 20m Warning Unhealthy pod/indice-6bc749d988-z627g Readiness probe failed: command "cat /var/run/readiness-check/readiness-file" timed out
startup-jardinmental--metabase-prod 20m Warning BackOff pod/refresh-views-28804800-n99jz Back-off restarting failed container cronjob in pod refresh-views-28804800-n99jz_startup-jardinmental--metabase-prod(3dfa0dca-db2e-4f97-bebc-b9d8303e804f)
startup-tumeplay--metabase-prod 21m Warning Unhealthy pod/metabase-matomo-sync-3 Liveness probe failed: Get "http://10.2.186.84:80
En cours correction pour les disallow-host-path
Nettoyage du VPA
attendre 1 semaine si pas de crash alors réactivation de Falco. Le 24 octobre
@igorrenquin regarder s'il y a eut sur la semaine d17 octobre une augmentation de la consommation de ressources. @igorrenquin brainstorm avec toi même pour essayer de trouver des hypothéses @FJEANNOT chercher des pistes ensemble.
premier plan d'action
Démarche de trouver la cause d'instabilité avant d'envisager d'autres solutions d'hébergements.