CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

[FEATURE] MySQL-OpenSearch Data Sync Monitoring #640

Open adambuttrick opened 1 month ago

adambuttrick commented 1 month ago

Problem Statement

Our plan for implementing OpenSearch is that both the MySQL database and OpenSearch index will be populated on record creation and update. With this, there is a risk of data inconsistency due to any number of factors, e.g. network issues, indexing failures, or data corruption. This could lead to data loss in the OpenSearch index and thus the inaccessibility of records in the UI or errors in reporting. A monitoring system is needed to detect and alert on any data synchronization issues in a timely manner.

Goals and Objectives

User Stories

  1. As an EZID admin or developer, I want to be notified when there is a data inconsistency between the MySQL database and the OpenSearch index, so that I can investigate and resolve the issue promptly.

  2. As an end-user, I want to return correct, up-to-date results and reports, so that I can find and modify the records I need without delay.

Technical Requirements

  1. Develop a monitoring script that periodically (e.g. daily) compares the records in the MySQL database with the corresponding documents in the OpenSearch index.

  2. Generate an alert (e.g., email, Slack message) when a data inconsistency is detected, including details such as the affected record(s), the nature of the discrepancy, and the timestamp of detection.

  3. Provide a manual process to reconcile data differences, such as triggering a re-indexing of the MySQL database to OpenSearch for the affected records.

  4. Store a log of the data inconsistencies detected for reporting and troubleshooting purposes.

  5. Ensure the monitoring and reindexing processes can occur without significant performance impact on the production environment.

Success Metrics

  1. All records are verifiably in sync between the MySQL database and the OpenSearch index.

  2. Detect and generate alerts for any data inconsistencies on the specified schedule.

  3. A manual process is available to trigger a re-indexing the MySQL database to OpenSearch for the affected records.

  4. The monitoring and reindexing processes do not significantly degrade the performance of the production environment.

sfisher commented 1 month ago

I believe this addresses most of these issues and I hope the sync situation is only a temporary state and we can remove most of the SearchIdentifier information not too long after we release application searching with OpenSearch.

The ezid-docs-internal details how some of this can work in the docs/ezid_search_opensearch.md file.

The basics of this are that currently the SearchIdentifier is only updated in one place by queued events and a daemon (this was already happening). I've modified this to be within a database transaction so if either update fails then it should revert both to the same previous state. I believe the queue retries items and a failure to update raises an exception which is logged and (which I believe we are be notified of).

I added additional options to update only recently-changed identifiers in the opensearch-update management script (after a certain date) if we need to intervene manuall. When it's not indexing 30 million records and only a few weeks or months then these updates happen quickly if we need to make manual interventions.