CDLUC3 / ezid

CDLUC3 ezid
MIT License
11 stars 4 forks source link

[EPIC] Redesign Queue System for Improved Error Handling, Retry Logic, and Monitoring #707

Open adambuttrick opened 3 weeks ago

adambuttrick commented 3 weeks ago

Background

As described in https://github.com/CDLUC3/ezid/issues/696, the current EZID queue system relies on daemons and background service scripts to execute tasks asynchronously. While functional, the system lacks any robust error handling and retry mechanisms, leading to permanent registration failures without logging or notifications, including to end users in the UI and reports.

Objective

Redesign the queue system to implement improved retry logic, error logging, with corresponding notifications derived therefrom. Update UI and report to indicate task failures to end users in the UI.

Features

1. Retry Mechanism

2. Error Logging

3. Queue Health Monitoring

4. UI and Reporting Changes

Success Criteria

Dependencies

adambuttrick commented 2 weeks ago

Sketch of current task queue design:

flowchart TD
    Start([Start]) --> Queue[Task added to Queue]
    Queue --> AsyncProcess[Async Processing Daemon]
    AsyncProcess --> CheckStatus{Check Status}
    CheckStatus -->|UNSUBMITTED| Process[Process Task]
    CheckStatus -->|UNCHECKED| Verify[Verify Submission]
    Process --> Operation{Operation Type}
    Operation -->|Create| CreateOp[Create Operation]
    Operation -->|Update| UpdateOp[Update Operation]
    Operation -->|Delete| DeleteOp[Delete Operation]
    CreateOp --> AttemptSubmit[Attempt to Submit]
    UpdateOp --> AttemptSubmit
    DeleteOp --> AttemptSubmit
    AttemptSubmit --> SubmitResult{Submission Result}
    SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
    SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
    SubmitResult -->|Failure| HandleFailure{Failure Type}
    HandleFailure -->|Temporary| MarkTransientFailure[Mark as TRANSIENT_FAILURE]
    HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
    MarkSubmitted --> Verify
    Verify --> VerifyResult{Verify Result}
    VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
    VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
    VerifyResult -->|Failure| HandleFailure
    MarkSuccess --> UpdateStatus[Update Task Status]
    MarkWarning --> UpdateStatus
    MarkIgnored --> UpdateStatus
    MarkTransientFailure --> UpdateStatus
    MarkFailure --> UpdateStatus
    UpdateStatus --> NextTask[Move to Next Task]
    NextTask --> CheckStatus
    CheckStatus -->|All Processed| Sleep[Sleep]
    Sleep --> AsyncProcess

    subgraph QueueTypes[Queue Types]
        BinderQueue[Binder Queue]
        CrossrefQueue[Crossref Queue]
        DataciteQueue[Datacite Queue]
        SearchIndexerQueue[Search Indexer Queue]
    end
    Queue --> QueueTypes

    subgraph StatusTypes[Status Types]
        UNSUBMITTED[UNSUBMITTED]
        UNCHECKED[UNCHECKED]
        SUBMITTED[SUBMITTED]
        WARNING[WARNING]
        FAILURE[FAILURE]
        TRANSIENT_FAILURE[TRANSIENT_FAILURE]
        IGNORED[IGNORED]
        SUCCESS[SUCCESS]
    end
adambuttrick commented 2 weeks ago

Possible redesign:

flowchart TD
    Start([Start]) --> Queue[Task added to Queue]
    Queue --> AsyncProcess[Async Processing Daemon]
    AsyncProcess --> CheckStatus{Check Status}
    CheckStatus -->|UNSUBMITTED or Retry| Process[Process Task]
    CheckStatus -->|UNCHECKED| Verify[Verify Submission]
    Process --> Operation{Operation Type}
    Operation -->|Create| CreateOp[Create Operation]
    Operation -->|Update| UpdateOp[Update Operation]
    Operation -->|Delete| DeleteOp[Delete Operation]
    CreateOp --> AttemptSubmit[Attempt to Submit]
    UpdateOp --> AttemptSubmit
    DeleteOp --> AttemptSubmit
    AttemptSubmit --> SubmitResult{Submission Result}
    SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
    SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
    SubmitResult -->|Failure| HandleFailure{Failure Type}
    HandleFailure -->|Temporary| RetryMechanism[Retry Mechanism]
    HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
    MarkSubmitted --> Verify
    Verify --> VerifyResult{Verify Result}
    VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
    VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
    VerifyResult -->|Failure| HandleFailure
    CheckStatus -->|All Processed| Sleep[Sleep]
    Sleep --> AsyncProcess

    subgraph RetryMechanism [Retry Mechanism]
        CheckRetryCount{Retry Count < Max}
        CheckRetryCount -->|Yes| ScheduleRetry[Schedule Retry]
        CheckRetryCount -->|No| MarkMaxRetriesReached[Mark Max Retries Reached]
        ScheduleRetry --> MarkTransientFailure[Mark as TRANSIENT_FAILURE]
    end

    subgraph Logging [Logging]
        LogSuccess[Log Success]
        LogWarning[Log Warning]
        LogIgnored[Log Ignored]
        LogRetryAttempt[Log Retry Attempt]
        LogFailure[Log Failure]
        LogMaxRetriesReached[Log Max Retries Reached]
    end

    subgraph StatusUpdate [Status Update]
        UpdateStatus[Update Task Status]
        NextTask[Move to Next Task]
    end

    MarkSuccess --> LogSuccess
    MarkWarning --> LogWarning
    MarkIgnored --> LogIgnored
    MarkTransientFailure --> LogRetryAttempt
    MarkFailure --> LogFailure
    MarkMaxRetriesReached --> LogMaxRetriesReached

    LogSuccess --> UpdateStatus
    LogWarning --> UpdateStatus
    LogIgnored --> UpdateStatus
    LogRetryAttempt --> UpdateStatus
    LogFailure --> UpdateStatus
    LogMaxRetriesReached --> UpdateStatus

    UpdateStatus --> NextTask
    NextTask --> CheckStatus

    subgraph QueueTypes [Queue Types]
        BinderQueue[Binder Queue]
        CrossrefQueue[Crossref Queue]
        DataciteQueue[Datacite Queue]
        SearchIndexerQueue[Search Indexer Queue]
    end
    Queue --> QueueTypes

    subgraph StatusTypes [Status Types]
        UNSUBMITTED[UNSUBMITTED]
        UNCHECKED[UNCHECKED]
        SUBMITTED[SUBMITTED]
        WARNING[WARNING]
        FAILURE[FAILURE]
        TRANSIENT_FAILURE[TRANSIENT_FAILURE]
        IGNORED[IGNORED]
        SUCCESS[SUCCESS]
    end