DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Indexing issues with failed jobs #37

Open doulikecookiedough opened 1 week ago

doulikecookiedough commented 1 week ago

There appears to be an indexing issue that is preventing the queue from moving along. Investigate and resolve.

Additional Context

doulikecookiedough commented 1 week ago

@taojing2002 The diagram for processIndexTaskQueue is split into two parts for easier viewing.

After reviewing the process, it seems like the failed tasks could be indefinitely held up (and stuck in the current states like what we see in Matt's examples) if the try-count for these tasks have already exceed the max (which appears to be set to 8). There does not appear to be code to handle further clean-up of tasks that have exceeded the max count.

image
processIndexTaskQueue Flow - Mermaid Syntax ```mermaid zenuml title processIndexTaskQueue Flow IndexTaskProcessor.processIndexTaskQueue() { // - From settings // "dataone.indexing // .processing.max.tryCount" maxTryCount (default: 8) // - Use Spring framework to // efficiently get the queue queue=IndexTaskRepository.getIndexTaskQueue() { // - findByStatus // AndTryCountLessThan // OrderByPriorityAsc // TaskModifiedDateAsc // ... // - New index tasks with less than `maxTryCount` // try-count in the index queue will be processed. // ... // - resource maps sometimes will be set the status new // even though the indexing failed findAsc(IndexTask.STATUS_NEW, maxTryCount) } task=getNextIndexTask(queue) while (task!= null) { processTaskOnThread(task) task = getNextIndexTask(queue) } processFailedIndexTaskQueue() { } } ```
image
processFailedIndexTaskQueue Flow - Mermaid Syntax ```mermaid zenuml title processFailedIndexTaskQueue Flow IndexTaskprocess.processIndexTaskQueue { // - Set up process // - Get queue and retrieve tasks // - processTaskOnThread processFailedIndexTaskQueue() { retryQueue=getIndexTaskRetryQueue() { // - findByStatus // AndNextExecutionLessThan // AndTryCountLessThan // .. // - Failed index tasks with less than `maxTryCount` // in the index queue will be processed." findThan(IndexTask.STATUS_FAILED, System.currentTimeMillis(), maxTryCount) } task=getNextIndexTask(retryQueue) while (task!= null) { processTaskOnThread(task); task = getNextIndexTask(queue); } } return } ```
mbjones commented 1 week ago

Thanks for those diagrams, @doulikecookiedough Could you create a state diagram that shows what are the possible state transitions among NEW, IN PROCESS, FAILED, COMPLETE status? Here's a rough cut as a starting point for you to edit/comment:

stateDiagram-v2
    [*] --> NEW
    NEW --> IN_PROCESS
    IN_PROCESS --> [*]
    IN_PROCESS --> FAILED
    IN_PROCESS --> COMPLETE
    FAILED --> NEW
    COMPLETE --> [*]
    FAILED --> [*]
doulikecookiedough commented 1 week ago

@mbjones Please see below for the state diagram.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: if != InterruptedException
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))

Notes:

Relevant Classes for Quick Reference:

mbjones commented 1 week ago

Thanks, Dou. That's a helpful discovery about complete. Here's another question. Is there any way for a task to end up in FAILED and stay there forever? I think there is (some tasks FAIL and transition back to NEW, others FAIL and stay there forever. Can you confirm or negate that? If we add the end state of the tasks into the diagram, and add a "DELETED" state to represent tasks that complete successfully and get deleted from the system, then I think the graph would look like this below. I put in two placeholders for conditions where tasks get permanently stuck in FAILED or IN_PROCESS state -- those can be removed if tasks never get stuck there.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: if != InterruptedException
    IN_PROCESS --> DELETED: deleted upon successful index
    DELETED --> [*]
    IN_PROCESS --> [*]: when do in process tasks get permanently stuck?
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    FAILED --> [*]: when do failed tasks get permanently stuck?
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))
doulikecookiedough commented 4 days ago

Is there any way for a task to end up in FAILED and stay there forever

Yes - a task can end up in a FAILED state forever. When a thread is interrupted (ex. InterruptedException), a task will be marked as NEW. For all other exceptions, it will be marked as FAILED. There is no path in which this can transition back to NEW at this point.

A task can also potentially remain IN PROCESS forever if a process is forcefully killed: the catch blocks won't be executed and the finally block only manages/cleans up the resource map and series id.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: If an exception occurs
    IN_PROCESS --> DELETED: deleted upon successful index
    DELETED --> [*]
    IN_PROCESS --> [*]: When the process is forcefully shut down
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    FAILED --> [*]: if != InterruptedException
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))
doulikecookiedough commented 4 days ago

I also synced up with Jing, and the reason there is no COMPLETE status in the end is because the tasks do get removed upon completion. If it were part of the process, the count would also become extremely large (and not quite helpful).

The current index processor also relies on the old Hazelcast process - it does not use the new dataone-indexer - meaning it still reads the tasks from a DB, whereas the new process retrieves it from RabbitMQ. To use the new dataone-indexer, the Hazelcast related code would need to be removed (and the existing code refactored accordingly).

There are still some background processes which uses Hazelcast (which would benefit from using the dataone-indexer), and due to some resourcing issues not all of it has been able to be addressed.