There are several error categories that can be identified based on the log files.
Docker or Network problems
Reasons:
docker daemon is not running
networks is unavailable to pull the docker image or it just doesn't exist
Markers:
Docker is not available for this tool in the log file
Actions:
contact support team
Workflow was stopped
Reasons:
not enough resources, Airflow killed the process
something set task status to the Failed one externally
Markers:
ERROR - Received SIGTERM. Terminating subprocesses in the log file
Actions:
Restart with the lower threads or memory parameters
Workflow step failed
Reasons:
anything that can cause cwltool return exit status -1 from the tool
Markers:
Failed to run workflow step in the log file
Actions:
contact support team
Notes:
The error is reported only after the whole workflow was marked as Failed. It will be always "" for the task failures. This allows us to send only one message with the meaningful error description and at the end of the workflow execution.
We parse log file only for the latest task retry run, because when the workflow failed it means that all previous possible task retries didn't bring any positive results and the latest log should be either the same as the previous ones or with the most recent reasons to fail.
All error categories are sorted by priority from higher to the lower levels. We report only one (the highest, the first found) error category per failed task. Error categories from all failed tasks are combined and deduplicated. The "Workflow step failed" category additionally is filled with failed task ids.
There are several error categories that can be identified based on the log files.
Docker or Network problems
Reasons:
Markers:
Docker is not available for this tool
in the log fileActions:
Workflow was stopped
Reasons:
Failed
one externallyMarkers:
ERROR - Received SIGTERM. Terminating subprocesses
in the log fileActions:
Workflow step failed
Reasons:
-1
from the toolMarkers:
Failed to run workflow step
in the log fileActions:
Notes:
error
is reported only after the whole workflow was marked as Failed. It will be always "" for the task failures. This allows us to send only one message with the meaningful error description and at the end of the workflow execution.