Closed russelg closed 1 month ago
Looks like it's related to the task system. Are you running the provided postgres image in the compose file or running another way? Any guesses on how often this happens? Any error logs for the postgres server?
Until I get around to finding out what the issue is, you could add a health check to the API container
healthcheck:
test: curl --fail http://localhost:4000/health || exit 1
interval: 60s
retries: 5
start_period: 60s
timeout: 10s
With a restart policy of always
or unless-stopped
this will restart the API container if it can't curl the health route.
It occurred after about 6 days of uptime on the API. I only noticed due to me refreshing a ganymede tab I had open, and it wouldn't load the data. I'm running postgres:14
in my compose file. I neglected to check the postgres logs from when it happened, so if it happens again I'll check those.
This has happened previously, about the same uptime as well.
I'll add that health check, seems useful.
Following up here, I'm almost certain the API was killed due to the server being out-of-memory.
I've been keeping tabs on that, and the API has had 316h uptime since I applied the healthcheck (i.e. it hasn't restarted since the change).
I'm happy to close this since even if I do run out of memory again, the healthcheck should restart the container.
Really the only memory intensive thing is watching a VOD with chat playback. I don't keep chat files in the database, for playback the chat is read once then stored in memory for X period of time. It's likely that this caused the OOM.
Not really sure what other info I can give about this, it just seems after a while that
ganymede-api
is being killed off. The container keeps running in this state, with the api non-functional.In any case,
entrypoint.sh
should probably try to revive the API if it exits.