API is being killed - Githubissues

russelg commented 1 month ago

Not really sure what other info I can give about this, it just seems after a while that ganymede-api is being killed off. The container keeps running in this state, with the api non-functional.

In any case, entrypoint.sh should probably try to revive the API if it exits.

ganymede-api  | time=2024-09-30T15:38:49.681+08:00 level=ERROR msg="Elector: Error attempting reelection" client_id=93f7b62bfa3a_2024_09_24T08_00_09_172340 err="timeout: context already done: context deadline exceeded" sleep_duration=1.021486529s
ganymede-api  | /usr/local/bin/entrypoint.sh: line 47:    33 Killed                  gosu abc /opt/app/ganymede-api
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T15:40:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T15:45:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T15:50:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T15:55:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"info","task":"check_channel_for_new_videos","job_id":"16373","time":"2024-09-30T16:00:13+08:00","message":"starting task"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"2","time":"2024-09-30T16:00:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"info","task":"check_channel_for_new_videos","job_id":"16373","time":"2024-09-30T16:00:13+08:00","message":"no channels to check"}
ganymede-api  | {"level":"info","task":"check_channel_for_new_videos","job_id":"16373","time":"2024-09-30T16:00:13+08:00","message":"task completed"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T16:05:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T16:10:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T16:15:13+08:00","message":"jobs found"}
ganymede-api  | {"level":"debug","task":"watchdog","jobs":"1","time":"2024-09-30T16:20:13+08:00","message":"jobs found"}

Zibbp commented 1 month ago

Looks like it's related to the task system. Are you running the provided postgres image in the compose file or running another way? Any guesses on how often this happens? Any error logs for the postgres server?

Until I get around to finding out what the issue is, you could add a health check to the API container

    healthcheck:
      test: curl --fail http://localhost:4000/health || exit 1
      interval: 60s
      retries: 5
      start_period: 60s
      timeout: 10s

With a restart policy of always or unless-stopped this will restart the API container if it can't curl the health route.

russelg commented 1 month ago

It occurred after about 6 days of uptime on the API. I only noticed due to me refreshing a ganymede tab I had open, and it wouldn't load the data. I'm running postgres:14 in my compose file. I neglected to check the postgres logs from when it happened, so if it happens again I'll check those.

This has happened previously, about the same uptime as well.

I'll add that health check, seems useful.

russelg commented 1 month ago

Following up here, I'm almost certain the API was killed due to the server being out-of-memory.

I've been keeping tabs on that, and the API has had 316h uptime since I applied the healthcheck (i.e. it hasn't restarted since the change).

I'm happy to close this since even if I do run out of memory again, the healthcheck should restart the container.

Zibbp commented 1 month ago

Really the only memory intensive thing is watching a VOD with chat playback. I don't keep chat files in the database, for playback the chat is read once then stored in memory for X period of time. It's likely that this caused the OOM.

Zibbp / ganymede

API is being killed #516