aneoconsulting / ArmoniK.Core

Task manager for ArmoniK (submission, scheduling, IO data, monitoring). Implements API located in https://github.com/aneoconsulting/ArmoniK.Api
https://aneoconsulting.github.io/ArmoniK.Core/
GNU Affero General Public License v3.0
10 stars 8 forks source link

fix: resubmit task when agent tries to acquire a task in Retried #762

Closed aneojgurhem closed 1 month ago

aneojgurhem commented 1 month ago

Motivation

Fix the issue in which retried task appear in Creating and are not processed further, preventing application from running to completion.

To my understanding, during downscaling, preemption or crash and when the worker crashes too, the agent can be abruptly stopped during the submission of the retried task (task creation, task finalization or insertion into the queue). We also observe that the agent tries to acquire the initial task with status Retried instead of the new copy of the task (the new attempt). The retried task is only in the Creating status, showing that the task finalization was not properly done. This means that the initial task' finalization has not completed properly.

Description

To mitigate this issue, we implement a recovery mechanism that, when an agent tries to acquire the initial task, the agent will properly finalize the retried task before removing the initial task from the queue instead of only removing the initial task from the queue.

When the initial task with status Retried is acquired by the agent, we check whether the retry task has been properly finalized. To do so, we get the metadata of the retried task. If we can retrieve them and the retried task is in Creating or Submitted we perform task finalization and insertion in the queue again. If the task was already inserted in the queue, task deduplication should do its job and ignore the duplicate. If the retried task is in another status, we remove the message from the queue. If the retried task is not found, we submit the retried task completely (creation, finalization, queueing). If the retried task has already been created between our read and our creation in the database, we check the status of the retried tasks and perform finalization if task is Creating or Submitted.

Testing

I was not able to fully reproduce this issue in Core docker deployment even while trying to stop agents abruptly with the following script. I used a modified bench worker that was only producing errors.

#!/bin/sh

for i in $(seq 1 40); do
    for a in $(docker ps -q --filter name=armonik.compute.pollingagent); do
        docker restart -s sigterm -t 0 $a
        # sleep 1
    done
done

I also tried to vary the delay and still was not able to reproduce.

I added unit tests that can put tasks in the same state that we observed. They also validate that we are able to recover from the invalid state and resubmit the task in retry.

Impact

Checklist