GNiklasch commented 1 year ago

Three times today the cloud-deployed web app ceased functioning: - the browser got an "Oh no!" error page, - the log indicated readiness probe failures, usually looking like

[12:49:52] ! Streamlit server consistently failed status checks
[12:49:52] ! Please fix the errors, push an update to the git repo, or reboot the app.

but on one occasion preceded by

[13:43:43] ! The service has encountered an error while checking the health of the Streamlit app: Get "http://localhost:8501/healthz": read tcp 10.12.171.54:46114->10.12.171.54:8501: read: connection reset by peer

- but no indication of having hit resource limits, - and pushing an update went through successfully (as confirmed by a log entry) but did not revive the deployment. (Only a reboot did, which rebuilds the "VM" from the ground up, pipenv dependency installation and all.)

In other words, the web server component on port 8501 had repeatedly died in mid-operation, and we do not know what (if anything) had happened behind it in the streamlit Python process and in the application script.

More investigation needed.

GNiklasch commented 1 year ago

13 may largely explain the observed crashes; additional robustness measures may still be in order.

GNiklasch commented 1 year ago

There have been no further incidents of this kind since the fix for #13 was rolled out.

GNiklasch / GWO-glitch-visualization

Stabilizing the cloud deployment #12

13 may largely explain the observed crashes; additional robustness measures may still be in order.