getodk / central-backend

Node.js based backend for ODK Central
https://docs.getodk.org/central-intro/
Apache License 2.0
50 stars 73 forks source link

Retry worker in case of transient failure #692

Open sadiqkhoja opened 1 year ago

sadiqkhoja commented 1 year ago

Related to https://github.com/getodk/central-backend/pull/685

Entity worker is currently catching all the errors and logging them in audit table. I think in case of transient errors (common in cloud environments) we should try to reprocess the event.

I have written a test to emulate database failure, see https://github.com/sadiqkhoja/central-backend/commit/36cd848d507276eb8059de5c3e45e8d61ac24b48 where Entities.createNew throws Database is down error and we are marking original event as processed and logging error details in audit table.

matthew-white commented 1 year ago

I think @ktuite convinced me that the error would have to be very transient for this to be an issue. If the database is down, or the connection pool is full, or something similar, then it seems very unlikely that Entities.createNew() would fail, yet the Audits.log() logging entity.create.error would be successful. I think it will be much more common for either both to succeed or neither to succeed. (And if neither succeeds, then the worker will try again.) A very transient error is a possibility, but I'm not sure it's worth us circling back to the worker. What do you think? Do you think we should try to be resilient to that possibility? Is the possibility more likely than I'm imagining?

sadiqkhoja commented 1 year ago

I agree that it is highly likely that Entities.createNew() and Audit.log() will succeed or fail together. However, if both fail due to transient error then the event will be reprocessed after 2 hours, which might cause out of order event processing.

One safety net that we have against transient errors is the default retry mechanism of Slonik, if I understand correctly, it retries 5 times before throwing the error.

Having said that I don't think this issue should be high risk item, we can observe how things play out during QA testing. I just wanted to log this just in case we have to revisit it.