Open adambuttrick opened 3 weeks ago
Sketch of current task queue design:
flowchart TD
Start([Start]) --> Queue[Task added to Queue]
Queue --> AsyncProcess[Async Processing Daemon]
AsyncProcess --> CheckStatus{Check Status}
CheckStatus -->|UNSUBMITTED| Process[Process Task]
CheckStatus -->|UNCHECKED| Verify[Verify Submission]
Process --> Operation{Operation Type}
Operation -->|Create| CreateOp[Create Operation]
Operation -->|Update| UpdateOp[Update Operation]
Operation -->|Delete| DeleteOp[Delete Operation]
CreateOp --> AttemptSubmit[Attempt to Submit]
UpdateOp --> AttemptSubmit
DeleteOp --> AttemptSubmit
AttemptSubmit --> SubmitResult{Submission Result}
SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
SubmitResult -->|Failure| HandleFailure{Failure Type}
HandleFailure -->|Temporary| MarkTransientFailure[Mark as TRANSIENT_FAILURE]
HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
MarkSubmitted --> Verify
Verify --> VerifyResult{Verify Result}
VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
VerifyResult -->|Failure| HandleFailure
MarkSuccess --> UpdateStatus[Update Task Status]
MarkWarning --> UpdateStatus
MarkIgnored --> UpdateStatus
MarkTransientFailure --> UpdateStatus
MarkFailure --> UpdateStatus
UpdateStatus --> NextTask[Move to Next Task]
NextTask --> CheckStatus
CheckStatus -->|All Processed| Sleep[Sleep]
Sleep --> AsyncProcess
subgraph QueueTypes[Queue Types]
BinderQueue[Binder Queue]
CrossrefQueue[Crossref Queue]
DataciteQueue[Datacite Queue]
SearchIndexerQueue[Search Indexer Queue]
end
Queue --> QueueTypes
subgraph StatusTypes[Status Types]
UNSUBMITTED[UNSUBMITTED]
UNCHECKED[UNCHECKED]
SUBMITTED[SUBMITTED]
WARNING[WARNING]
FAILURE[FAILURE]
TRANSIENT_FAILURE[TRANSIENT_FAILURE]
IGNORED[IGNORED]
SUCCESS[SUCCESS]
end
Possible redesign:
flowchart TD
Start([Start]) --> Queue[Task added to Queue]
Queue --> AsyncProcess[Async Processing Daemon]
AsyncProcess --> CheckStatus{Check Status}
CheckStatus -->|UNSUBMITTED or Retry| Process[Process Task]
CheckStatus -->|UNCHECKED| Verify[Verify Submission]
Process --> Operation{Operation Type}
Operation -->|Create| CreateOp[Create Operation]
Operation -->|Update| UpdateOp[Update Operation]
Operation -->|Delete| DeleteOp[Delete Operation]
CreateOp --> AttemptSubmit[Attempt to Submit]
UpdateOp --> AttemptSubmit
DeleteOp --> AttemptSubmit
AttemptSubmit --> SubmitResult{Submission Result}
SubmitResult -->|Success| MarkSubmitted[Mark as SUBMITTED]
SubmitResult -->|Ignored| MarkIgnored[Mark as IGNORED]
SubmitResult -->|Failure| HandleFailure{Failure Type}
HandleFailure -->|Temporary| RetryMechanism[Retry Mechanism]
HandleFailure -->|Permanent| MarkFailure[Mark as FAILURE]
MarkSubmitted --> Verify
Verify --> VerifyResult{Verify Result}
VerifyResult -->|Success| MarkSuccess[Mark as SUCCESS]
VerifyResult -->|Warning| MarkWarning[Mark as WARNING]
VerifyResult -->|Failure| HandleFailure
CheckStatus -->|All Processed| Sleep[Sleep]
Sleep --> AsyncProcess
subgraph RetryMechanism [Retry Mechanism]
CheckRetryCount{Retry Count < Max}
CheckRetryCount -->|Yes| ScheduleRetry[Schedule Retry]
CheckRetryCount -->|No| MarkMaxRetriesReached[Mark Max Retries Reached]
ScheduleRetry --> MarkTransientFailure[Mark as TRANSIENT_FAILURE]
end
subgraph Logging [Logging]
LogSuccess[Log Success]
LogWarning[Log Warning]
LogIgnored[Log Ignored]
LogRetryAttempt[Log Retry Attempt]
LogFailure[Log Failure]
LogMaxRetriesReached[Log Max Retries Reached]
end
subgraph StatusUpdate [Status Update]
UpdateStatus[Update Task Status]
NextTask[Move to Next Task]
end
MarkSuccess --> LogSuccess
MarkWarning --> LogWarning
MarkIgnored --> LogIgnored
MarkTransientFailure --> LogRetryAttempt
MarkFailure --> LogFailure
MarkMaxRetriesReached --> LogMaxRetriesReached
LogSuccess --> UpdateStatus
LogWarning --> UpdateStatus
LogIgnored --> UpdateStatus
LogRetryAttempt --> UpdateStatus
LogFailure --> UpdateStatus
LogMaxRetriesReached --> UpdateStatus
UpdateStatus --> NextTask
NextTask --> CheckStatus
subgraph QueueTypes [Queue Types]
BinderQueue[Binder Queue]
CrossrefQueue[Crossref Queue]
DataciteQueue[Datacite Queue]
SearchIndexerQueue[Search Indexer Queue]
end
Queue --> QueueTypes
subgraph StatusTypes [Status Types]
UNSUBMITTED[UNSUBMITTED]
UNCHECKED[UNCHECKED]
SUBMITTED[SUBMITTED]
WARNING[WARNING]
FAILURE[FAILURE]
TRANSIENT_FAILURE[TRANSIENT_FAILURE]
IGNORED[IGNORED]
SUCCESS[SUCCESS]
end
Background
As described in https://github.com/CDLUC3/ezid/issues/696, the current EZID queue system relies on daemons and background service scripts to execute tasks asynchronously. While functional, the system lacks any robust error handling and retry mechanisms, leading to permanent registration failures without logging or notifications, including to end users in the UI and reports.
Objective
Redesign the queue system to implement improved retry logic, error logging, with corresponding notifications derived therefrom. Update UI and report to indicate task failures to end users in the UI.
Features
1. Retry Mechanism
2. Error Logging
3. Queue Health Monitoring
4. UI and Reporting Changes
Success Criteria
Dependencies