Closed 19950813 closed 3 years ago
Hello @19950813,
If I understood you correctly, you have commands that sometimes fail after 24 hours or so, but ecFlow does not see it as an error (the task seems to succeed). Or are you saying that the ecFlow task fails, but the job output reports no reason for it and therefore you do not know why the task failed? If the latter, then I suspect that you need to set the timeout. See https://confluence.ecmwf.int/display/ECFLOW/Glossary - there is an environment variable ECF_TIMEOUT, which is set to 24 hours (number is in seconds). I think you should be able to set this at the level of the specific task, or anywhere above it.
I hope this helps.
Best regards, Iain
Hi @iainrussell That should be the latter case. The process that performed the task became a zombie process. Thank you.
Thanks @19950813 - are you happy if I close this issue now?
Hi@iainrussell: Yes,of course,my problem has been solved.thank you
Hi @iainrussell : Suddenly encountered another problem. After a task encounters an abnormal failure, it will be executed again. Can this setting be turned off?
Hi @19950813,
I think this is also answered in the same page - try setting ECF_TRIES=1 in your suite. Search for ECF_TRIES here: https://confluence.ecmwf.int/display/ECFLOW/Glossary
Cheers, Iain
Hi @iainrussell , Thank you so much for your advice again!
Hi @19950813,
I think this is also answered in the same page - try setting ECF_TRIES=1 in your suite. Search for ECF_TRIES here: https://confluence.ecmwf.int/display/ECFLOW/Glossary
Cheers, Iain
Hi,I set it up according to your requirements, but it still doesn't work, and it ran twice when the execution failed.Below is my head.h:
#!%SHELL:/bin/ksh%
set -e # stop the shell on first error
set -u # fail when using an undefined variable
set -x # echo script lines as they are executed
set -o pipefail # fail if last(rightmost) command exits with a non-zero status
# Defines the variables that are needed for any communication with ECF
export ECF_PORT=%ECF_PORT% # The server port number
export ECF_HOST=%ECF_HOST% # The host name where the server is running
export ECF_NAME=%ECF_NAME% # The name of this current task
export ECF_PASS=%ECF_PASS% # A unique password
export ECF_TRYNO=%ECF_TRYNO% # Current try number of the task
export ECF_RID=$$ # record the process id. Also used for zombie detection
# 用于设置 脚本的 失败运行策略 默认为 2
export ECF_TRIES=1
export error_msg='error'
# Define the path where to find ecflow_client
# make sure client and server use the *same* version.
# Important when there are multiple versions of ecFlow
export PATH=/usr/local/apps/ecflow/%ECF_VERSION%/bin:$PATH
# Tell ecFlow we have started
ecflow_client --init=$$
# Define a error handler
ERROR() {
set +e # Clear -e flag, so we don't fail
wait # wait for background process to stop
echo $error_msg
curl %log_url%/%task_id%/3/$error_msg
curl %log_url%/%task_parent_id%/3/$error_msg
ecflow_client --abort=trap # Notify ecFlow that something went wrong, using 'trap' as the reason
trap 0 # Remove the trap
exit 0 # End the script
}
# Trap any calls to exit and errors caught by the -e flag
trap ERROR 0
# Trap any signal that may cause the script to fail
trap '{ echo "Killed by a signal"; ERROR ; }' 1 2 3 4 5 6 7 8 10 12 13 15
Hi @19950813,
Actually ECF_TRIES is a suite variable, not an environment variable. In your suite definition you will need something like
task test
edit ECF_TRIES '1'
Hi @iainrussell , Thank you so much for your advice again!
You're very welcome!