ecmwf / ecflow

ECMWF's workflow manager
Apache License 2.0
41 stars 15 forks source link

About the longest running time of a task #10

Closed 19950813 closed 3 years ago

19950813 commented 3 years ago
iainrussell commented 3 years ago

Hello @19950813,

If I understood you correctly, you have commands that sometimes fail after 24 hours or so, but ecFlow does not see it as an error (the task seems to succeed). Or are you saying that the ecFlow task fails, but the job output reports no reason for it and therefore you do not know why the task failed? If the latter, then I suspect that you need to set the timeout. See https://confluence.ecmwf.int/display/ECFLOW/Glossary - there is an environment variable ECF_TIMEOUT, which is set to 24 hours (number is in seconds). I think you should be able to set this at the level of the specific task, or anywhere above it.

I hope this helps.

Best regards, Iain

19950813 commented 3 years ago

Hi @iainrussell That should be the latter case. The process that performed the task became a zombie process. Thank you.

iainrussell commented 3 years ago

Thanks @19950813 - are you happy if I close this issue now?

19950813 commented 3 years ago

Hi@iainrussell: Yes,of course,my problem has been solved.thank you

19950813 commented 3 years ago

Hi @iainrussell : Suddenly encountered another problem. After a task encounters an abnormal failure, it will be executed again. Can this setting be turned off?

iainrussell commented 3 years ago

Hi @19950813,

I think this is also answered in the same page - try setting ECF_TRIES=1 in your suite. Search for ECF_TRIES here: https://confluence.ecmwf.int/display/ECFLOW/Glossary

Cheers, Iain

19950813 commented 3 years ago

Hi @iainrussell , Thank you so much for your advice again!

19950813 commented 3 years ago

Hi @19950813,

I think this is also answered in the same page - try setting ECF_TRIES=1 in your suite. Search for ECF_TRIES here: https://confluence.ecmwf.int/display/ECFLOW/Glossary

Cheers, Iain

Hi,I set it up according to your requirements, but it still doesn't work, and it ran twice when the execution failed.Below is my head.h:


#!%SHELL:/bin/ksh%
set -e          # stop the shell on first error
set -u          # fail when using an undefined variable
set -x          # echo script lines as they are executed
set -o pipefail # fail if last(rightmost) command exits with a non-zero status

# Defines the variables that are needed for any communication with ECF
export ECF_PORT=%ECF_PORT%    # The server port number
export ECF_HOST=%ECF_HOST%    # The host name where the server is running
export ECF_NAME=%ECF_NAME%    # The name of this current task
export ECF_PASS=%ECF_PASS%    # A unique password
export ECF_TRYNO=%ECF_TRYNO%  # Current try number of the task
export ECF_RID=$$             # record the process id. Also used for zombie detection
# 用于设置 脚本的 失败运行策略 默认为 2
export ECF_TRIES=1
export error_msg='error'

# Define the path where to find ecflow_client
# make sure client and server use the *same* version.
# Important when there are multiple versions of ecFlow
export PATH=/usr/local/apps/ecflow/%ECF_VERSION%/bin:$PATH

# Tell ecFlow we have started
ecflow_client --init=$$

# Define a error handler
ERROR() {
   set +e                      # Clear -e flag, so we don't fail
   wait                        # wait for background process to stop
   echo $error_msg
   curl %log_url%/%task_id%/3/$error_msg
   curl %log_url%/%task_parent_id%/3/$error_msg
   ecflow_client --abort=trap  # Notify ecFlow that something went wrong, using 'trap' as the reason
   trap 0                      # Remove the trap
   exit 0                      # End the script
}

# Trap any calls to exit and errors caught by the -e flag
trap ERROR 0

# Trap any signal that may cause the script to fail
trap '{ echo "Killed by a signal"; ERROR ; }' 1 2 3 4 5 6 7 8 10 12 13 15
iainrussell commented 3 years ago

Hi @19950813,

Actually ECF_TRIES is a suite variable, not an environment variable. In your suite definition you will need something like

task test
    edit ECF_TRIES '1'
19950813 commented 3 years ago

Hi @iainrussell , Thank you so much for your advice again!

iainrussell commented 3 years ago

You're very welcome!