cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Implement an option to restart jobs terminated by SIGUSR2 #39

Closed logust79 closed 5 years ago

logust79 commented 6 years ago

Sometimes some jobs are randomly terminated by SIGUSR2. Do you think it would be a good idea to introduce an option to restart the terminated job?

sebastian-luna-valero commented 6 years ago

Hi,

The fact that your jobs are getting a signal means that there is a different underlying issue and you might want to investigate that further.

Which batch scheduler are you using?

In SGE, there is the -notify option has been introduced in cgat-core. That was not the case in the CGATOxford repository.

Below is the explanation:

       -notify

              This  flag,  when  set  causes  Grid Engine to send "warning" signals to a running job prior to sending the signals themselves. If a SIGSTOP is pending, the job will receive a SIGUSR1 several seconds
              before the SIGSTOP. If a SIGKILL is pending, the job will receive a SIGUSR2 several seconds before the SIGKILL.  This option provides the running job, before receiving the SIGSTOP or SIGKILL, a  con-
              figured time interval to do e.g. cleanup operations.  The amount of time delay is controlled by the notify parameter in each queue configuration

We could try adding the -r option to SGE if that's useful:

       -r y[es]|n[o]

              Identifies the ability of a job to be rerun or not.  If the value of -r is 'yes', the job will be rerun if the job was aborted without leaving a consistent exit state.  (This is typically the case if
              the node on which the job is running crashes).  If -r is 'no', the job will not be rerun under any circumstances.

I hope that helps.

Best regards, Sebastian

Acribbs commented 5 years ago

closed but please reopen if required.