ChalmersGU-data-structure-courses / lab-system

Lab system and other scripts for grading and course administration (Canvas and GitLab integration). Currently used by the CSE department at Chalmers and GU for several courses, including data structures and algorithms.
0 stars 0 forks source link

Proper notifications for event loop failure #24

Open sattlerc opened 1 month ago

sattlerc commented 1 month ago

Currently, the event loop has a hacky flag -e that allows you to specify a Google spreadsheet that will be edited with an exception message when the event loop fails. By subscribing to notifications on that spreadsheet, we can get failure notifications.

It would be better to solve this in a more standard way that doesn't rely on external services.

Now that we transition to using systemd to run the event loop as a (user-level) service, we can use systemd for this monitoring using the OnFailure feature. This should activate a unit that collects the exception from the end of the log and sends it by email to the local user. (In turn, the local user can configure forwarding addresses in .forward). A complication here: log messages may be multiline.

sattlerc commented 1 month ago

@Niklas-Deworetzki Do you want to take a look at this? Peter can give you time.

Niklas-Deworetzki commented 1 month ago

Sure. The industry standard for monitoring would be to use Prometheus. We could start a prometheus client together with the event loop that reports different metrics (checked repositories, completed/rejected outgoing connections, number of tags created by students). But that might be a bit overkill in this case.

I think the error reporting is better done as part of the script itself. If we catch top-level error messages in the event loop, we can also try to alert an administrator for them. There is probably more we can do in Python than by using some black magic and systemd.

sattlerc commented 1 month ago

I think the error reporting is better done as part of the script itself. If we catch top-level error messages in the event loop, we can also try to alert an administrator for them. There is probably more we can do in Python than by using some black magic and systemd.

How about this?

sattlerc commented 2 weeks ago

Who is working on this?