SocialGouv / support

Support de l'activité des startups SocialGouv
http://socialgouv.github.io/support
7 stars 5 forks source link

Investigation Plateforme OVH #582

Open igorrenquin opened 6 months ago

igorrenquin commented 6 months ago

premier plan d'action

Démarche de trouver la cause d'instabilité avant d'envisager d'autres solutions d'hébergements.

igorrenquin commented 6 months ago

Equipe SRE investigue sur CDTN

igorrenquin commented 5 months ago

Falco, Crowdsec supprimé du DEV 14 juin

octomir commented 5 months ago

semaine du 24/06: Matéo a initié des travaux de déploiement de CDTN dans un nodepool dédié

igorrenquin commented 4 months ago

constat @gary-van-woerkens cdtn est isolé en dev et plus de plantage. Est-ce-que le plantage ne serai pas lie à CDTN + autre chose???

A tester en prod!? GO pour tester en prod

gary-van-woerkens commented 3 months ago

Un reboot de noeud a eu lieu ce matin. Malgré le fait que CDTN ait été bougé sur un noeud dédié (sauf la partie Hasura qui est toujours sur les autres noeuds).

igorrenquin commented 2 months ago

L'isolation de CDTN n'était pas compléte. Une mise à jour va être réalisé.

Encore 3 semaines pour valider.

igorrenquin commented 2 months ago

Tout CDTN est passé sur les nodepool dédiés.

Surveiller le crash des noeuds cdtn Surveiller le noncrash des noeuds core

channel mattermost : https://mattermost.fabrique.social.gouv.fr/default/channels/alertsovh-prod

octomir commented 2 months ago

Roolback de CDTN qui est passé sur les nodespools core + worker

igorrenquin commented 1 month ago
carte-jeune-engage                          21m         Warning   PolicyViolation          job/cron-notification-28803840                                 policy disallow-host-path/autogen-host-path fail: validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-host-path failed at path /spec/template/spec/volumes/0/hostPath/
carte-jeune-engage                          32m         Warning   PolicyViolation          cronjob/cron-notification                                      policy disallow-host-path/autogen-cronjob-host-path fail: validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-cronjob-host-path failed at path /spec/jobTemplate/spec/template/spec/volumes/0/hostPath/
ci-data-ia                                  4m24s       Warning   SecretSyncError          vaultstaticsecret/vault-kubeconfig                             Failed to update k8s secret: invalid owner label, key=app.kubernetes.io/name, present=false...
ci-data-ia                                  37m         Warning   SecretSyncError          vaultstaticsecret/vault-kubeconfig                             Failed to update k8s secret: invalid owner label, key=app.kubernetes.io/component, present=false...
code-du-travail-numerique                   2m35s       Warning   Unhealthy                pod/app-78c5ddc857-48w7q                                       Liveness probe failed: Get "http://10.2.199.68:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
default                                     32m         Warning   PolicyViolation          clusterpolicy/disallow-host-path                               CronJob carte-jeune-engage/cron-notification: [autogen-cronjob-host-path] fail; validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-cronjob-host-path failed at path /spec/jobTemplate/spec/template/spec/volumes/0/hostPath/
default                                     21m         Warning   PolicyViolation          clusterpolicy/disallow-host-path                               Job carte-jeune-engage/cron-notification-28803840: [autogen-host-path] fail; validation error: HostPath volumes are forbidden. The field spec.volumes[*].hostPath must be unset. rule autogen-host-path failed at path /spec/template/spec/volumes/0/hostPath/
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     32m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     33m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     32m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     32m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     32m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     32m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     31m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     18m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     17m         Warning   PolicyError              clusterpolicy/fabrique-vpa                                     policy fabrique-vpa/create-vpa error: <nil>
default                                     21m         Warning   OOMKilling               node/prod-core-nodepool-node-03d66d                            Memory cgroup out of memory: Killed process 2930908 (celery) total-vm:451576kB, anon-rss:340856kB, file-rss:7976kB, shmem-rss:0kB, UID:1000 pgtables:832kB oom_score_adj:983
default                                     17m         Warning   OOMKilling               node/prod-core-nodepool-node-03d66d                            Memory cgroup out of memory: Killed process 1645873 (celery) total-vm:498568kB, anon-rss:383436kB, file-rss:8004kB, shmem-rss:0kB, UID:1000 pgtables:892kB oom_score_adj:983
default                                     20m         Warning   OOMKilling               node/prod-core-nodepool-node-8a86e9                            Memory cgroup out of memory: Killed process 2641421 (postgres) total-vm:431536kB, anon-rss:83004kB, file-rss:57440kB, shmem-rss:132840kB, UID:26 pgtables:764kB oom_score_adj:984
enfants-du-spectacle                        5m37s       Warning   Unhealthy                pod/app-ffd755f98-8vdkh                                        Liveness probe failed: Get "http://10.2.198.105:3000/api/healthz": dial tcp 10.2.198.105:3000: connect: connection refused
enfants-du-spectacle                        5m37s       Warning   Unhealthy                pod/app-ffd755f98-8vdkh                                        Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state
enfants-du-spectacle                        80s         Warning   FindingCluster           backup/pg-test-scheduledbackup-20241005000000                  Unknown cluster pg-test, will retry in 30 seconds
enfants-du-spectacle                        80s         Warning   FindingCluster           backup/pg-test-scheduledbackup-20241006000000                  Unknown cluster pg-test, will retry in 30 seconds
enfants-du-spectacle                        96s         Warning   FindingCluster           backup/pg-test-scheduledbackup-20241007000000                  Unknown cluster pg-test, will retry in 30 seconds
jardinmental                                21m         Warning   Unhealthy                pod/app-59ff494c55-g4p4m                                       Liveness probe failed: Get "http://10.2.35.127:3000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
jardinmental                                21m         Warning   Unhealthy                pod/app-59ff494c55-lx6vr                                       Liveness probe failed: Get "http://10.2.23.206:3000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
recosante                                   20m         Warning   Unhealthy                pod/indice-6bc749d988-z627g                                    Readiness probe failed: command "cat /var/run/readiness-check/readiness-file" timed out
startup-jardinmental--metabase-prod         20m         Warning   BackOff                  pod/refresh-views-28804800-n99jz                               Back-off restarting failed container cronjob in pod refresh-views-28804800-n99jz_startup-jardinmental--metabase-prod(3dfa0dca-db2e-4f97-bebc-b9d8303e804f)
startup-tumeplay--metabase-prod             21m         Warning   Unhealthy                pod/metabase-matomo-sync-3                                     Liveness probe failed: Get "http://10.2.186.84:80
igorrenquin commented 1 month ago

En cours correction pour les disallow-host-path Nettoyage du VPA

igorrenquin commented 1 month ago

attendre 1 semaine si pas de crash alors réactivation de Falco. Le 24 octobre

igorrenquin commented 1 month ago

@igorrenquin regarder s'il y a eut sur la semaine d17 octobre une augmentation de la consommation de ressources. @igorrenquin brainstorm avec toi même pour essayer de trouver des hypothéses @FJEANNOT chercher des pistes ensemble.