aneoconsulting / ArmoniK.Core

Task manager for ArmoniK (submission, scheduling, IO data, monitoring). Implements API located in https://github.com/aneoconsulting/ArmoniK.Api
https://aneoconsulting.github.io/ArmoniK.Core/
GNU Affero General Public License v3.0
10 stars 8 forks source link

fix: Timeout handling after acquisition #774

Closed lemaitre-aneo closed 2 weeks ago

lemaitre-aneo commented 2 weeks ago

Motivation

Since the refactor to use the exception manager, tasks that were acquired, but not processed because the runningTaskProcessor did not finish executing the current task in the allotted time were not release in the ArmoniK sense. The message from the queue was put back into the queue, but the task itself remained in the dispatched state (acquired by the current agent).

This had two implications: such a task would need more work to be re-acquired by another pod by using the message duplication algorithm, and the timeout was considered like an actual error of the agent, and would make the agent unhealthy after a few acquire timeouts.

Description

This PR adds a proper catch for the timeout, and release the task in the catch.

Testing

A new test has been added to ensure that the pollster does not produce any error when the timeout occurs, and that the task is actually released properly.

Impact

This should help with long running tasks and avoid agent restarts. It should also help improve the performance of the orchestration on long running tasks.

Additional Information

NA

Checklist