TritonDataCenter / node-workflow

Task orchestration, creation and running using NodeJS
MIT License
456 stars 66 forks source link

Task runner does not appear to handle retry of tasks consistently on timeout #145

Closed ghost closed 8 years ago

ghost commented 9 years ago

Consider the following workflow:

{
    name: 'TestWorkflow',
    timeout: 300,
    chain: [
        {
            name: 'DoFirstTask',
            timeout: 5,
            retry: 10,
            body: function(job,cb){
              job.foo = (job.foo || 0) + 1;
              console.log(job.foo);
              if(job.foo > 5) {
                console.log('i should be done now');
                return cb(null,job.foo);
              }
            },
            fallback: function(err,job,cb){
              console.error(err);
              return cb(err);
           }
        },
        {
            name: 'DoSecondTask',
            body: function(job,cb){
              console.log('second task finished');
              return cb(null);
            }
        }
    ]
}

Running this job produces the following console.log output consistently: 1 2 3 4 5 6 i should be done now 7 i should be done now 8 i should be done now 9 i should be done now 10 i should be done now task timeout error {"name":"workflow-runner","runner_uuid":"7075b17b-c198-40b8-8daa-a4ad7741d258","hostname":"my-host","pid":9264,"level":40,"job":{"uuid":"9335b419-79d1-40e3-9a20-453664b45ade","name":"TestWorkflow","execution":"failed"},"runtime":51437,"msg":"job completed","time":"2015-09-15T18:25:51.199Z","v":0}

I would expect the first invocation of cb(null) following previous timeouts of the same task to have stopped retrying and continued down the chain. Any idea what I'm doing wrong, or is this a bug?

ghost commented 9 years ago

Looks like the bug is here:

    function retryTask(cb) {
        retries += 1;

        // Set the task timeout when given:
        if (timeout) {
            clearTaskTimeoutId(taskTimeoutId);
            // Task timeout must be in seconds:
            taskTimeoutId = setTimeout(function () {
                retryTimedOut = true;
                return onRetryError('task timeout error', cb);
            }, timeout);
        }

        return body(job, function (err, res) {
            if (retryTimedOut) {
                retryTimedOut = false;
                return null;
            }

If you ever hit the timeout and retryTimedOut ever gets set to true, there's nothing to initialize it back to false for every subsequent retry, thus every non-error invocation of the callback hits the return null in the first lines of body whether that particular retry has timed out or not.

I just want to confirm whether there's some reason for this to work this way before submitting a PR to fix.

kusor commented 9 years ago

Shouldn't your task body return cb(null, ...) also when job.foo <= 5?. Is intended to hit a timeout with such approach?

kusor commented 9 years ago

Also, if that's the case, I think it's perfectly fine for the job to keep retrying until it reaches the given number of retries, despite of the reason for the task failure, timeout included.

ghost commented 9 years ago

No, the example I'm giving is simulating a case where the body action doesn't complete within the specified task timeout. I'm not sure why I would want the job to keep retrying up to the max retries when any one of those retries succeeded and invoked the success callback. I want it to stop retrying and move along.

job.foo <= 5 is just a condition to trigger a timeout; when it goes over 5 that's when the function should invoke the callback. Just pretend that's some kind of long-running process that exceeds the timeout five times and the sixth attempt is finally successful.

By only resetting retryTimedOut post-invocation of the callback, assuming it happens within the timeout period (or at all), it never gets reset properly back to false at the beginning of each successive retry. Thus each retry task run begins its life with the previous invocation's timeout state, which I don't think is correct. In any case, retryTimedOut is an attribute of one specific invocation of retryTask, not an attribute of the overall task.

Again for clarity, this case only illustrates a task run timing out and never invoking the callback, then subsequently succeeding on retry.

kusor commented 8 years ago

Ok, sounds good now :-). Would you mind to modify the pull request for #146 to just include a single commit before we merge?. It's a little bit too much 4 commits for such scoped change. Thanks in advance!

ghost commented 8 years ago

I just created a new PR, as I couldn't figure out how to remove commits from the old one.