Increase Tobira Worker Resilience When Opencast is Unreachable

geichelberger commented 5 months ago

If Opencast becomes unreachable, the Tobira worker crashes and causes the systemd service to fail because of unsuccessful retry attempts. This circumstance can be caused by network outages or updates from Opencast.

The expected behavior would be for the worker not to exit, handle the error, and, importantly, continue running.

LukasKalbertodt commented 5 months ago

Can you give more details? The worker should in fact not fail when Opencast is down. I regularly look at three long-running Tobira systems and there the worker never failed because of an unavailable Opencast. It just prints errors to the log but recovers automatically. So you have to give me more details to reproduce your error state. What Tobira version? What exactly are you doing?

geichelberger commented 5 months ago

Log:

Jun 04 06:42:47 oc-presentation-01.xyz systemd[1]: Started Tobira Worker.
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.155 INFO  tobira >  Starting Tobira ~~ cli_args=["/opt/tobira/tobira", "worker", "-c", "/etc/tobira/config.toml"]
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.155 INFO  tobira >  Loaded config ~~ source_file="/etc/tobira/config.toml"
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.155 INFO  tobira >  Starting Tobira worker ...
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.171 INFO  tobira::db >  Connected to DB! ~~ server_version="15.3" user="tobira" session_user="tobira" schema="tobira" database="tobira"
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.176 INFO  tobira::db::migrations >  All migrations are already applied: database schema is up to date.
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.246 INFO  tobira::search >  Connected to MeiliSearch at 'https://oc-index-02.xyz:7700'
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: 2024-06-04 06:42:47.287 ERROR tobira >  error synchronizing with Opencast
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:                                      >  
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:                                      >  Caused by:
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:                                      >      0: failed to fetch API version
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:                                      >      1: API returned unexpected HTTP code 503 Service Unavailable (for 'https://xyz/tobira/version', authenticating as 'admin')
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: ▶▶▶ Error: error synchronizing with Opencast
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]: Caused by:
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:  ‣ failed to fetch API version
Jun 04 06:42:47 oc-presentation-01.xyz tobira[157784]:    ‣ API returned unexpected HTTP code 503 Service Unavailable (for 'https://xyz/tobira/version', authenticating as 'admin')
Jun 04 06:42:47 oc-presentation-01.xyz systemd[1]: tobira-worker.service: Main process exited, code=exited, status=1/FAILURE
Jun 04 06:42:47 oc-presentation-01.xyz systemd[1]: tobira-worker.service: Failed with result 'exit-code'.
Jun 04 06:42:47 oc-presentation-01.xyz systemd[1]: tobira-worker.service: Scheduled restart job, restart counter is at 5.
Jun 04 06:42:47 oc-presentation-01.xyz systemd[1]: Stopped Tobira Worker.

LukasKalbertodt commented 5 months ago

Oh so you are saying the worker cannot be started while Opencast is down? But a running work does not go down with Opencast. Yes?

geichelberger commented 4 months ago

Sorry, I should have been a little bit more precise.

elan-ev / tobira

Increase Tobira Worker Resilience When Opencast is Unreachable #1175