bcgov / traction

Traction is designed with an API-first architecture layered on top of Hyperledger Aries Cloud Agent Python (ACA-Py) and streamlines the process of sending and receiving digital credentials for governments and organizations.
https://digital.gov.bc.ca/digital-trust/tools/traction/
Apache License 2.0
52 stars 47 forks source link

Investigation: Acapy status endpoints can hang up during some operations and cause probe failures #466

Closed loneil closed 1 year ago

loneil commented 1 year ago

I noticed that /status/live and /status/ready will hang and not return when doing certain longer running operations like creating a tenant or registering a did, or even in a smaller way while getting a token.

https://traction-acapy-admin-dev.apps.silver.devops.gov.bc.ca/api/doc#/server/get_status_ready

image

So since those are the k8s probes that will fail readiness/liveliness and potentially kill the pod (and crash out Acapy) if they occur enough (which happens more readily in lower resource envs like PRs, but will occur in others as well).

image

So seems important to figure out if this is:

loneil commented 1 year ago

@usingtechnology has discussed. At the least, going with a higher probe timeout like other Acapy projects use, since these calls do seem to return, just hang for extra time while these operations are happening. Can see later if Aca-py ticket is needed?

usingtechnology commented 1 year ago

talked with @WadeBarnes and confirmed he has previously deployed using liveness probe time of 10 seconds. we had 2 seconds. this timeout/pod restore happened with 3 seconds in test namespace so did a PR that bumped to 10 - seems to be ok.

usingtechnology commented 1 year ago

see PR 473