firefly-iii / firefly-iii

Firefly III: a personal finances manager
https://firefly-iii.org/
GNU Affero General Public License v3.0
16.33k stars 1.48k forks source link

/health endpoint randomly failing on k8s #9480

Open eliottness opened 3 days ago

eliottness commented 3 days ago

Support guidelines

I've found a bug and checked that ...

Description

Around 5-10 minutes after a firefly container as started, the /health endpoint stops responding OK. This leads the kubernetes scheduler to kill the pod and restart it, which slowly leads the pod to transition to CrashloopBackoff, and thus becoming unavailable.

Debug information

Debug information generated at 2024-11-18 20:51:23 for Firefly III version v6.1.22.

System information
ItemValue
Firefly III6.1.22 / v2.1.0 / 25 (exp. 25)
PHP version8.3.13 (64bits) / apache2handler / Linux x86_64
BCscale12
Error reportingDisplay: Off, reporting: ALL errors
Max upload67108864 (64 MB)
Database driversmysql, *pgsql*, sqlite,
Docker build#1147, base #92
Firefly III information
ItemValue
TimezoneEurope/Paris + Europe/Paris
App environmentproduction, debug: false
Layoutv1
Logginginfo, stack / (empty)
Cache driverfile
Default language and localefr_FR + equal
Trusted proxies**
Login provider & user guardeloquent / remote_user_guard
Login headersX-authentik-email + X-authentik-email
Stateful domains
Last cron job2024-11-17 23:00:00 (20 hours ago)
Mailersmtp
User-specific information
ItemValue
User#1 of 3
User flags:ledger: :wrench: :clock130: :email:
Session start2024-10-01 00:00:00
Session end2024-12-31 00:00:00
View range3M
User languageen_GB
User localeen_GB
Locale(s) supporteden_GB.utf8: :white_check_mark:
en_GB.UTF-8: :white_check_mark:
User agentMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0

Expected behaviour

I would expect the endpoint /health to continue returning OK.

Steps to reproduce

  1. Setup a Rancher (or minikube) cluster
  2. Use the helm chart setup described here: https://firefly-iii.github.io/kubernetes/
  3. Wait and look for Unhealthy k8s events.

Additional info

Since there is no relevant logs each times a pod get killed this took me a while to unearth this.

Here is the kubernetes event even through it should be fairly useless:

Unhealthy: Liveness probe failed: Get "http://10.42.2.67:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Killing: Container firefly-iii failed liveness probe, will be restarted 

I can only do wild guesses but this could be a rate limiting issue maybe... ? The only current workaround is editing manually the deployment manifest and remove all health checks.

JC5 commented 3 days ago

Hey, thanks for opening an issue.

I know nothing about kubernetes, so I won't be able to do anything kubernetes-related about this. But I'm surprised: the health checkpoint times out, and the rest doesn't? Because they're all tied to the same code, there's no rate limiting or anything.

NerdyShawn commented 1 day ago

is this the stack chart? If you check the app pod logs does it give more detail, for example I think mine was having issues referencing the postgres database, and hence why the app pod would never go healthy.

JC5 commented 6 hours ago

That could be the issue, sure. Could you share some more details?

NerdyShawn commented 5 hours ago

I've been sorting through a # of issues on this chart recently and hope to help on some of the chart efforts, but I think this particular issues the app pod never goes healthy is if you look a app svc logs, its referncing service firefly-db which doesn't exist.


k logs -n firefly svc/firefly-iii | grep firefly-db | tail -n 1
[previous exception] [object] (PDOException(code: 0): PDO::__construct(): php_network_getaddresses: getaddrinfo for firefly-db failed: Name or service not known at /var/www/html/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:65)

The default service names from the chart it doesn't align with what the app pod is looking for which the db is actually named in my case firefly-iii-firefly-db .

k get svc -n firefly-iii 
NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
firefly-iii              ClusterIP   10.43.88.229   <none>        80/TCP     17m
firefly-iii-firefly-db   ClusterIP   10.43.221.9    <none>        5432/TCP   17m
firefly-iii-importer     ClusterIP   10.43.191.66   <none>        80/TCP     17m

This app is expecting what is in its env variable, but the stack chart sets a different dbhost I believe. Essentially some inconsistencies in naming of things in different locations.