Need to handle KATPortalClient timeout errors better

david-macmahon commented 4 years ago

Occasionally, the KATPortalClient's connection to the KATPortal server times out. When this happens, manual intervention is required to get the system back into an operational state. The reason for these timeouts is not understood and may be outside our code base, but regardless of the underlying cause, KATPortalClient should handle this situation more gracefully so that the backend remains in an operational state (to whatever extent that's possible).

danielczech commented 4 years ago

This bug is not as easy to track down as the others, since the particular timeout error never occurs when testing on the CAM development system. As mentioned above, so far it has only occurred (intermittently) during live observations. In the earlier katportal_server version, when this error occurred, the current observation would be lost (but the katportal_server would restart, allowing subsequent observations to continue).

I have tracked the problem down to the schedule_blocks sensor, and have made two changes (see 2205214) to try to handle this particular timeout more gracefully.

Firstly, I have manually specified a timeout duration for run_sync which will hopefully be sufficient. This raises the question: should a timeout duration be specified for all run_sync calls? The error has not been observed for any of the other "once-off" sensors so far.

Secondly, I have used a try block which will facilitate debugging during the next testing session and at least permit the current observation to continue without intervention (minus the schedule-block information).

I hope to test these improvements during the next testing session (likely 2020-05-28) as I have been unable to replicate the error with the development system.

danielczech commented 4 years ago

Following the testing session, it appears that explicitly extending the timeout duration has prevented this error occurring for the schedule_blocks sensor. However, we observed the error occurring again for a different run_sync call; therefore it seems likely that these measures will be needed for every run_sync call.

UCBerkeleySETI / meerkat-backend-interface

Need to handle KATPortalClient timeout errors better #14