Open doulikecookiedough opened 1 week ago
@taojing2002 The diagram for processIndexTaskQueue
is split into two parts for easier viewing.
After reviewing the process, it seems like the failed tasks could be indefinitely held up (and stuck in the current states like what we see in Matt's examples) if the try-count for these tasks have already exceed the max (which appears to be set to 8
). There does not appear to be code to handle further clean-up of tasks that have exceeded the max count.
Thanks for those diagrams, @doulikecookiedough Could you create a state diagram that shows what are the possible state transitions among NEW, IN PROCESS, FAILED, COMPLETE status? Here's a rough cut as a starting point for you to edit/comment:
stateDiagram-v2
[*] --> NEW
NEW --> IN_PROCESS
IN_PROCESS --> [*]
IN_PROCESS --> FAILED
IN_PROCESS --> COMPLETE
FAILED --> NEW
COMPLETE --> [*]
FAILED --> [*]
@mbjones Please see below for the state diagram.
stateDiagram-v2
[*] --> NEW: On creation of 'IndexTask' task object
NEW --> IN_PROCESS: getNextIndexTask
IN_PROCESS --> NEW: if object path is not ready
IN_PROCESS --> FAILED: if != InterruptedException
IN_PROCESS
FAILED --> NEW: if = InterruptedException
note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))
Notes:
IN_PROCESS
is the last state that is marked for tasks on their way to completion.removeIdsFromResourceMapReferencedSetAndSeriesIdsSet
is called.
- I thought at first this method would mark a task as COMPLETE
eventually, but at this time it will only removes the pid
from the seriesIdsSet
(and id
from referencedIdsMap
).
IN_PROCESS
count from your queries are likely COMPLETE
but are misrepresented state-wise/textually
- I do see the COMPLETE
status in the IndexTask
class, but I can't seem to locate where in the flow a task could potentially get marked as such. There is no associated method like markComplete
that I'm aware of.
- I will sync up with @taojing2002 to discuss the state diagram above (and confirm that tasks are never marked with COMPLETE
)Relevant Classes for Quick Reference:
Thanks, Dou. That's a helpful discovery about complete. Here's another question. Is there any way for a task to end up in FAILED and stay there forever? I think there is (some tasks FAIL and transition back to NEW, others FAIL and stay there forever. Can you confirm or negate that? If we add the end state of the tasks into the diagram, and add a "DELETED" state to represent tasks that complete successfully and get deleted from the system, then I think the graph would look like this below. I put in two placeholders for conditions where tasks get permanently stuck in FAILED or IN_PROCESS state -- those can be removed if tasks never get stuck there.
stateDiagram-v2
[*] --> NEW: On creation of 'IndexTask' task object
NEW --> IN_PROCESS: getNextIndexTask
IN_PROCESS --> NEW: if object path is not ready
IN_PROCESS --> FAILED: if != InterruptedException
IN_PROCESS --> DELETED: deleted upon successful index
DELETED --> [*]
IN_PROCESS --> [*]: when do in process tasks get permanently stuck?
IN_PROCESS
FAILED --> NEW: if = InterruptedException
FAILED --> [*]: when do failed tasks get permanently stuck?
note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))
Is there any way for a task to end up in FAILED and stay there forever
Yes - a task can end up in a FAILED
state forever. When a thread is interrupted (ex. InterruptedException
), a task will be marked as NEW
. For all other exceptions, it will be marked as FAILED
. There is no path in which this can transition back to NEW
at this point.
A task can also potentially remain IN PROCESS
forever if a process is forcefully killed: the catch
blocks won't be executed and the finally
block only manages/cleans up the resource map and series id.
stateDiagram-v2
[*] --> NEW: On creation of 'IndexTask' task object
NEW --> IN_PROCESS: getNextIndexTask
IN_PROCESS --> NEW: if object path is not ready
IN_PROCESS --> FAILED: If an exception occurs
IN_PROCESS --> DELETED: deleted upon successful index
DELETED --> [*]
IN_PROCESS --> [*]: When the process is forcefully shut down
IN_PROCESS
FAILED --> NEW: if = InterruptedException
FAILED --> [*]: if != InterruptedException
note right of IN_PROCESS: There seems to be no part where a status is marked as 'COMPLETE'. Once a task has been taken care of, it ends by being removed from the seriesIdSet (ex. seriesIdsSet.remove(id))
I also synced up with Jing, and the reason there is no COMPLETE
status in the end is because the tasks do get removed upon completion. If it were part of the process, the count would also become extremely large (and not quite helpful).
The current index processor also relies on the old Hazelcast process - it does not use the new dataone-indexer
- meaning it still reads the tasks from a DB, whereas the new process retrieves it from RabbitMQ. To use the new dataone-indexer
, the Hazelcast related code would need to be removed (and the existing code refactored accordingly).
There are still some background processes which uses Hazelcast (which would benefit from using the dataone-indexer
), and due to some resourcing issues not all of it has been able to be addressed.
There appears to be an indexing issue that is preventing the queue from moving along. Investigate and resolve.
Additional Context
Yesterday status breakdown (2019/09/17)
Today's status breakdown (2019/09/18)
Breakdown by format
d1-index-queue=# select formatid, status, count(*) as cnt from index_task GROUP BY formatid, status ORDER BY status,formatid;