Open aarongundel opened 1 year ago
endpoints for kubernetes liveness, readiness and startup probes should be provided (/ready
, /health
)
Something that indicates celery status, if it's running, and web-based access to the celery log so you can see what it's doing.
@aj-he I'm looking at working on this soon. I like the two endpoint approach, especially since Arches depends on so many external services. The liveness endpoint will tell you if Arches is running and the ready endpoint will make sure Arches can connect to all of its dependent services.
Here's a list of the external services that I'll check as part of this... Postgres Elasticsearch File Storage (including external file storage) Cataloupe (if configured) Redis/Memcached (if configured) External Authentication (if configured) Celery Broker (if configured) Celery Workers (if configured on local machine)
If you can think of more things that might be checked as part of this, let me know. Perhaps some of these should be optional (as in, Arches still shows as ready if an optional component doesn't work).
If you've got Arches deployed using Kubernetes right now, I'd love to get your thoughts on what might also help HE here - or at the very least to leave these endpoints open for extension.
@aarongundel It would be nice if this was designed in a very modular way allowing developers to write components for services they want to monitor that are not part of the core Arches stack. I think Cantaloupe is a good example because it's technically not a requirement even for AFS.
@aarongundel, I agree with @chiatt as we also use other services outside of core services. Don't forget RabbitMQ.
We don't have a pure Kubernetes deployment, but we do have a prototype Azure Container App service (only v6 Arches tho - looking at v7), which is a managed K8S service underneath, and I'm keen on getting some health checks to support that.
It would be nice to have a health check and status page for arches. Ideally, a health check URL would tell the health of the arches application so that an automated process could check the URL and (if necessary) remove unhealthy instances from a pool of Arches instances. A status page would be a user-friendly page that showed the status of the system and its dependencies. An example would be: the database, elasticsearch, cantaloupe, etc. This would help users with high level diagnostics if there are problems with an Arches install.