bcgov / traction

Traction is designed with an API-first architecture layered on top of Hyperledger Aries Cloud Agent Python (ACA-Py) and streamlines the process of sending and receiving digital credentials for governments and organizations.
https://digital.gov.bc.ca/digital-trust/tools/traction/
Apache License 2.0
52 stars 48 forks source link

Define uptime endpoints for Traction deployment #979

Closed loneil closed 8 months ago

loneil commented 8 months ago

As a Traction operator, I want to know which endpoints to query for things like uptime monitors, so that I can be alerted if a piece of the architecture is not reachable from outside traffic.

Three parts IMO to monitor, ACA-Py, the Traction NGINX proxy, and the Tenant UI

ACA-Py admin API Live

Ready

Traction Proxy The Traction proxy just forwards all requests to Traction, but using the hidden api key where appropriate (not relevant to the open live and ready), so really can just check the same ACA-py endpoints at the different URL to see if the proxy is up I think Live

Ready

Tenant UI The live and ready checks on OCP are just at / here right now. We could add separate /ready or something in the Node app but for now would just use below as they are live and don't need a change and promotion through envs.

loneil commented 8 months ago

@WadeBarnes if the above is ok we can use these endpoints.

Not sure if there's additional uptime stuff through Crunchy for DB we would want to use or anything, or if that's covered in other monitoring. Maybe a question for @i5okie

WadeBarnes commented 8 months ago

I've added the following checks to our https://ditp.uptime.vonx.io/ and https://ditp.sla.vonx.io/ dashboards: image

The proxies and agents are covered by using the proxy endpoint to check the ready endpoint for the agent. We only had enough licenses to add the checks for the sandbox and prod UIs after that. I've requested licenses for more checks.

For now the UI checks test to see if the UI responds with a 200 code. That will do for now. If we start running into specific types of UI availability issues we can start getting more fancy.

Up/Down notifications go to me and the ditp-uptime and ditp-uptime-prod RC channels.

I'm calling this done for now.

loneil commented 8 months ago

Great, thanks @WadeBarnes !