/health endpoint randomly failing on k8s

eliottness commented 3 days ago

Support guidelines

[x] I've read the support guidelines

I've found a bug and checked that ...

[x] ... the documentation does not mention anything about my problem
[x] ... there are no open or closed issues that are related to my problem
[x] ... it's definitely a Firefly III issue, not me

Description

Around 5-10 minutes after a firefly container as started, the /health endpoint stops responding OK. This leads the kubernetes scheduler to kill the pod and restart it, which slowly leads the pod to transition to CrashloopBackoff, and thus becoming unavailable.

Debug information

Debug information generated at 2024-11-18 20:51:23 for Firefly III version v6.1.22.

System information
Item	Value
Firefly III	6.1.22 / v2.1.0 / 25 (exp. 25)
PHP version	8.3.13 (64bits) / apache2handler / Linux x86_64
BCscale	12
Error reporting	Display: Off, reporting: ALL errors
Max upload	67108864 (64 MB)
Database drivers	mysql, *pgsql*, sqlite,
Docker build	#1147, base #92

Firefly III information
Item	Value
Timezone	Europe/Paris + Europe/Paris
App environment	production, debug: false
Layout	v1
Logging	info, stack / (empty)
Cache driver	file
Default language and locale	fr_FR + equal
Trusted proxies	**
Login provider & user guard	eloquent / remote_user_guard
Login headers	X-authentik-email + X-authentik-email
Stateful domains
Last cron job	2024-11-17 23:00:00 (20 hours ago)
Mailer	smtp

User-specific information
Item	Value
User	#1 of 3
User flags	:ledger: :wrench: :clock130: :email:
Session start	2024-10-01 00:00:00
Session end	2024-12-31 00:00:00
View range	3M
User language	en_GB
User locale	en_GB
Locale(s) supported	en_GB.utf8: :white_check_mark: en_GB.UTF-8: :white_check_mark:
User agent	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0

Expected behaviour

I would expect the endpoint /health to continue returning OK.

Steps to reproduce

Setup a Rancher (or minikube) cluster
Use the helm chart setup described here: https://firefly-iii.github.io/kubernetes/
Wait and look for Unhealthy k8s events.

Additional info

Since there is no relevant logs each times a pod get killed this took me a while to unearth this.

Here is the kubernetes event even through it should be fairly useless:

Unhealthy: Liveness probe failed: Get "http://10.42.2.67:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Killing: Container firefly-iii failed liveness probe, will be restarted

I can only do wild guesses but this could be a rate limiting issue maybe... ? The only current workaround is editing manually the deployment manifest and remove all health checks.

JC5 commented 3 days ago

Hey, thanks for opening an issue.

I know nothing about kubernetes, so I won't be able to do anything kubernetes-related about this. But I'm surprised: the health checkpoint times out, and the rest doesn't? Because they're all tied to the same code, there's no rate limiting or anything.

NerdyShawn commented 1 day ago

is this the stack chart? If you check the app pod logs does it give more detail, for example I think mine was having issues referencing the postgres database, and hence why the app pod would never go healthy.

JC5 commented 6 hours ago

That could be the issue, sure. Could you share some more details?

NerdyShawn commented 5 hours ago

I've been sorting through a # of issues on this chart recently and hope to help on some of the chart efforts, but I think this particular issues the app pod never goes healthy is if you look a app svc logs, its referncing service firefly-db which doesn't exist.


k logs -n firefly svc/firefly-iii | grep firefly-db | tail -n 1
[previous exception] [object] (PDOException(code: 0): PDO::__construct(): php_network_getaddresses: getaddrinfo for firefly-db failed: Name or service not known at /var/www/html/vendor/laravel/framework/src/Illuminate/Database/Connectors/Connector.php:65)

The default service names from the chart it doesn't align with what the app pod is looking for which the db is actually named in my case firefly-iii-firefly-db .

k get svc -n firefly-iii 
NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
firefly-iii              ClusterIP   10.43.88.229   <none>        80/TCP     17m
firefly-iii-firefly-db   ClusterIP   10.43.221.9    <none>        5432/TCP   17m
firefly-iii-importer     ClusterIP   10.43.191.66   <none>        80/TCP     17m

This app is expecting what is in its env variable, but the stack chart sets a different dbhost I believe. Essentially some inconsistencies in naming of things in different locations.

firefly-iii / firefly-iii