artefactual-sdps / enduro

A tool to support ingest and automation in digital preservation workflows
https://enduro.readthedocs.io/
Apache License 2.0
4 stars 3 forks source link

Problem: workflows can fail with activity heartbeat timeouts #960

Open jraddaoui opened 3 months ago

jraddaoui commented 3 months ago

Is your feature request related to a problem? Please describe.

In some high load scenarios and environments with limited resources we have seen workflows ending unexpectedly with activity heartbeat timeouts:

2024-05-08T20:33:12.510Z    V(2)    preprocessing-worker.temporal   log/with_logger.go:84   error   {"Namespace": "default", "TaskQueue": "preprocessing", "WorkerID": "1@preprocessing-worker-0@", "WorkflowType": "preprocessing", "WorkflowID": "preprocessing-538f2b75-3175-4730-a5ee-fdfc6a8410d3", "RunID": "b5ee427d-0f6b-4f27-a0e4-f86870854fe8", "Attempt": 1, "err": "error downloading package: activity error (type: DownloadPackageActivity, scheduledEventID: 15, startedEventID: 16, identity: ): activity Heartbeat timeout (type: Heartbeat)", "error": "Workflow completed with errors!"}

Describe the solution you'd like

Thanks to the detective work done by @DanielCosme, we found out that increasing the activities HeartbeatTimeout and the worker DefaultHeartbeatThrottleInterval and MaxHeartbeatThrottleInterval may reduce the likelihood of such timeouts. This values should be configurable so they can be set based on the expected system load and available resources.

Describe alternatives you've considered

Over-provision everywhere!

Additional context

Check @DanielCosme PR implementing this solution in artefactual-labs/enduro: https://github.com/artefactual-labs/enduro/pull/612

DanielCosme commented 3 months ago

I went deep into the rabbit hole, being able to configure the timeout values is definitely useful and a must have due to the variety of environments this system can run. However the root cause for timeout failures at a high SIP count in a queue was different, I was able to make the timeouts no more for up to 30k (I did no more tests) queued SIPs via configuring the concurrent workflows the worker is willing to work at a time. Check this PR https://github.com/artefactual-labs/enduro/pull/616 @jraddaoui