Indexing issues with failed jobs

doulikecookiedough commented 1 week ago

There appears to be an indexing issue that is preventing the queue from moving along. Investigate and resolve.

Additional Context

Yesterday status breakdown (2019/09/17)

status cnt

FAILED 199483

IN PROCESS 355224

NEW 1
Today's status breakdown (2019/09/18)

status cnt

FAILED 199486

IN PROCESS 355262

status	cnt
FAILED	199483
IN PROCESS	355224
NEW	1

status	cnt
FAILED	199486
IN PROCESS	355262

Breakdown by format d1-index-queue=# select formatid, status, count(*) as cnt from index_task GROUP BY formatid, status ORDER BY status,formatid;

formatid	status	cnt
application/pdf	FAILED	14
application/vnd.google-earth.kml xml	FAILED	2
eml://ecoinformatics.org/eml-2.0.1	FAILED	138
eml://ecoinformatics.org/eml-2.1.0	FAILED	4
eml://ecoinformatics.org/eml-2.1.1	FAILED	17614
http://ns.dataone.org/metadata/schema/onedcx/v1.0	FAILED	1
https://eml.ecoinformatics.org/eml-2.2.0	FAILED	7
https://purl.dataone.org/portals-1.1.0	FAILED	3
http://www.isotc211.org/2005/gmd	FAILED	3
http://www.openarchives.org/ore/terms	FAILED	181310
image/geotiff	FAILED	1
image/jpeg	FAILED	6
netcdf-4	FAILED	175
science-on-schema.org/Dataset;ld+json	FAILED	98
text/csv	FAILED	2
text/plain	FAILED	86
text/x-matlab	FAILED	23
application/pdf	IN PROCESS	403
application/pdf	IN PROCESS	8
eml://ecoinformatics.org/eml-2.1.0	IN PROCESS	15
eml://ecoinformatics.org/eml-2.1.1	IN PROCESS	84755
FGDC-STD-001.2-1999	IN PROCESS	1855
https://eml.ecoinformatics.org/eml-2.2.0	IN PROCESS	129568
https://purl.dataone.org/portals-1.1.0	IN PROCESS	10
http://www.isotc211.org/2005/gmd	IN PROCESS	8386
http://www.openarchives.org/OAI/2.0/oai_dc/	IN PROCESS	1
http://www.openarchives.org/ore/terms	IN PROCESS	118466
image/geotiff	IN PROCESS	21
netcdf-4	IN PROCESS	1119
science-on-schema.org/Dataset;ld+json	IN PROCESS	10653
text/csv	IN PROCESS	2

doulikecookiedough commented 1 week ago

@taojing2002 The diagram for processIndexTaskQueue is split into two parts for easier viewing.

After reviewing the process, it seems like the failed tasks could be indefinitely held up (and stuck in the current states like what we see in Matt's examples) if the try-count for these tasks have already exceed the max (which appears to be set to 8). There does not appear to be code to handle further clean-up of tasks that have exceeded the max count.

processIndexTaskQueue Flow - Mermaid Syntax

```mermaid zenuml title processIndexTaskQueue Flow IndexTaskProcessor.processIndexTaskQueue() { // - From settings // "dataone.indexing // .processing.max.tryCount" maxTryCount (default: 8) // - Use Spring framework to // efficiently get the queue queue=IndexTaskRepository.getIndexTaskQueue() { // - findByStatus // AndTryCountLessThan // OrderByPriorityAsc // TaskModifiedDateAsc // ... // - New index tasks with less than `maxTryCount` // try-count in the index queue will be processed. // ... // - resource maps sometimes will be set the status new // even though the indexing failed findAsc(IndexTask.STATUS_NEW, maxTryCount) } task=getNextIndexTask(queue) while (task!= null) { processTaskOnThread(task) task = getNextIndexTask(queue) } processFailedIndexTaskQueue() { } } ```

processFailedIndexTaskQueue Flow - Mermaid Syntax

```mermaid zenuml title processFailedIndexTaskQueue Flow IndexTaskprocess.processIndexTaskQueue { // - Set up process // - Get queue and retrieve tasks // - processTaskOnThread processFailedIndexTaskQueue() { retryQueue=getIndexTaskRetryQueue() { // - findByStatus // AndNextExecutionLessThan // AndTryCountLessThan // .. // - Failed index tasks with less than `maxTryCount` // in the index queue will be processed." findThan(IndexTask.STATUS_FAILED, System.currentTimeMillis(), maxTryCount) } task=getNextIndexTask(retryQueue) while (task!= null) { processTaskOnThread(task); task = getNextIndexTask(queue); } } return } ```

mbjones commented 1 week ago

Thanks for those diagrams, @doulikecookiedough Could you create a state diagram that shows what are the possible state transitions among NEW, IN PROCESS, FAILED, COMPLETE status? Here's a rough cut as a starting point for you to edit/comment:

stateDiagram-v2
    [*] --> NEW
    NEW --> IN_PROCESS
    IN_PROCESS --> [*]
    IN_PROCESS --> FAILED
    IN_PROCESS --> COMPLETE
    FAILED --> NEW
    COMPLETE --> [*]
    FAILED --> [*]

doulikecookiedough commented 1 week ago

@mbjones Please see below for the state diagram.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: if != InterruptedException
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))

Notes:

IN_PROCESS is the last state that is marked for tasks on their way to completion.
If the task is successfully indexed/processed, then removeIdsFromResourceMapReferencedSetAndSeriesIdsSet is called. - I thought at first this method would mark a task as COMPLETE eventually, but at this time it will only removes the pid from the seriesIdsSet (and id from referencedIdsMap).
- So the IN_PROCESS count from your queries are likely COMPLETE but are misrepresented state-wise/textually - I do see the COMPLETE status in the IndexTask class, but I can't seem to locate where in the flow a task could potentially get marked as such. There is no associated method like markComplete that I'm aware of. - I will sync up with @taojing2002 to discuss the state diagram above (and confirm that tasks are never marked with COMPLETE)

Relevant Classes for Quick Reference:

mbjones commented 1 week ago

Thanks, Dou. That's a helpful discovery about complete. Here's another question. Is there any way for a task to end up in FAILED and stay there forever? I think there is (some tasks FAIL and transition back to NEW, others FAIL and stay there forever. Can you confirm or negate that? If we add the end state of the tasks into the diagram, and add a "DELETED" state to represent tasks that complete successfully and get deleted from the system, then I think the graph would look like this below. I put in two placeholders for conditions where tasks get permanently stuck in FAILED or IN_PROCESS state -- those can be removed if tasks never get stuck there.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: if != InterruptedException
    IN_PROCESS --> DELETED: deleted upon successful index
    DELETED --> [*]
    IN_PROCESS --> [*]: when do in process tasks get permanently stuck?
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    FAILED --> [*]: when do failed tasks get permanently stuck?
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))

doulikecookiedough commented 4 days ago

Is there any way for a task to end up in FAILED and stay there forever

Yes - a task can end up in a FAILED state forever. When a thread is interrupted (ex. InterruptedException), a task will be marked as NEW. For all other exceptions, it will be marked as FAILED. There is no path in which this can transition back to NEW at this point.

A task can also potentially remain IN PROCESS forever if a process is forcefully killed: the catch blocks won't be executed and the finally block only manages/cleans up the resource map and series id.

stateDiagram-v2
    [*] --> NEW: On creation of 'IndexTask' task object
    NEW --> IN_PROCESS: getNextIndexTask
    IN_PROCESS --> NEW: if object path is not ready
    IN_PROCESS --> FAILED: If an exception occurs
    IN_PROCESS --> DELETED: deleted upon successful index
    DELETED --> [*]
    IN_PROCESS --> [*]: When the process is forcefully shut down
    IN_PROCESS
    FAILED --> NEW: if = InterruptedException
    FAILED --> [*]: if != InterruptedException
    note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))

doulikecookiedough commented 4 days ago

I also synced up with Jing, and the reason there is no COMPLETE status in the end is because the tasks do get removed upon completion. If it were part of the process, the count would also become extremely large (and not quite helpful).

The current index processor also relies on the old Hazelcast process - it does not use the new dataone-indexer - meaning it still reads the tasks from a DB, whereas the new process retrieves it from RabbitMQ. To use the new dataone-indexer, the Hazelcast related code would need to be removed (and the existing code refactored accordingly).

There are still some background processes which uses Hazelcast (which would benefit from using the dataone-indexer), and due to some resourcing issues not all of it has been able to be addressed.

DataONEorg / d1_cn_index_processor

Indexing issues with failed jobs #37