CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

[TEST] Dual Write (DB and OpenSearch) Failiure Alert, Transaction Rollback, and Re-queued Update #684

Closed adambuttrick closed 2 weeks ago

adambuttrick commented 2 months ago

See ticket https://github.com/CDLUC3/ezid/issues/696 for information about queued task error handling. There are about 5 tasks that use the same pattern as OpenSearch where they log failures to the database but don't retry or notify.


Describe the functionality to be tested We need to verify that the dual write process (DB and OpenSearch) fails gracefully and maintains data consistency when an error occurs during the update transaction, as described in #640.

Describe the test scenario

In the test environment for OpenSearch:

  1. Create a sample identifier with associated metadata.
  2. Trigger an update event for the identifier.
  3. Simulate various types of failures in updating the database and OpenSearch index.
  4. Verify alert is received.
  5. Verify daemon re-queues failed task.

Expected outcome

  1. The update process should detect the simulated failure.
  2. The transaction should be rolled back, reverting both the database and OpenSearch index to their previous states.
  3. An exception should be raised and logged.
  4. The failed update should be added back to the queue for retry.
  5. The database and OpenSearch index remain in sync and consistent with each other after failed updated is processed.

Who would benefit from this test? Devs responsible for maintaining the integrity of the search index.

adambuttrick commented 2 months ago

Part of https://github.com/CDLUC3/ezid/issues/653

sfisher commented 1 month ago

I have tested this out and added some specific functionality to do retries (which the other task queues aren't really doing). I think a more general solution for EZID's handling of these task queues is in order to know when failures happen and have a more robust retry mechanism. My mechanism will retry every 5 minutes for up to a day upon a failure so it should be able to handle most reasonable outages or maintenance or other things that are not extended or catastrophic.

How I tested the atomic functionality and created an error:

  1. I went into the file settings/settings.py and misconfigured the connection to OpenSearch by putting in incorrect authentication.
  2. Started the daemon that queues and updates the search information: python manage.py proc-search-indexer
  3. Added a new ARK with metadata.
  4. Examined these database tables and the OpenSearch index for dev.
    • Check the table ezidapp_searchindexerqueue for the latest items in the queue and there will be an entry for the update indicating a 403 error in the error message.
    • Check the table ezidapp_searchidentifier. The record being written to the database does not exist so it is handling atomically (both must succeed or fail) rather than creating inconsistent state.
    • It will not exist in OpenSearch because authentication failed.
  5. Stop the daemon (CTRL-C or kill it).
  6. Fix the authentication for OpenSearch to be correct in the file settings/settings.py.
  7. Restart the daemon (since it must be restarted to read the corrected credentials).
  8. Wait until the 5 minutes have passed since the failure, but not more than a day. It will attempt again every 5 minutes and then stop trying.
    • These are not reset every second since many anticipated error conditions (OpenSearch being down, network problems, maintenance outage) may take a bit of time to self-resolve (if they do) and we shouldn't waste resource retrying every second, imo. After 24 hours of retries since initial enqueueTime the item will be assumed to be a permanent failure and retries stop.
  9. The item should have another attempt and succeed this time with OpenSearch configured correctly. The entry appears in both OpenSearch and the database.While this specific test isn't anticipated to happen (misconfiguration) it can imitate similar results to the kinds of errors we might get with services being down or having network issues (the correction of the credentials would be similar to the service or network coming back up).
adambuttrick commented 2 weeks ago

Released with https://github.com/CDLUC3/ezid/releases/tag/v3.2.19