cargomedia / cm-janus

UNMAINTAINED. cm/janus bridge
MIT License
2 stars 6 forks source link

Introduce backoff for failing jobs #273

Closed tomaszdurka closed 8 years ago

njam commented 8 years ago

Could you do this soon? Maybe Aleksei can help?

We're reaching capacity at Loggly again.

screen shot 2016-07-13 at 17 31 38
vogdb commented 8 years ago

I can always help!

tomaszdurka commented 8 years ago

As discussed this should add delay to failing jobs and double that delay every time job fails again (with maximum of 1 minute).

vogdb commented 8 years ago

Well. It is already 60secs delay:

function JobProcessor(workingDirectory, concurrency, retryDelay) {
  //...
  this._retyDelay = retryDelay || 60000;
}
this._processor = new JobProcessor(tempFilesDir);
tomaszdurka commented 8 years ago

Seems delay doesn't work as expected. See mutliple jobs with the same job id logged at the same time: [Removed image]

UPDATE: Actually this was wrong assumption. There were just multiple jobs failing with the same jobData.id which is stream.id (not the jobId - seems we have a bug which wipes jobId from janus-context https://github.com/cargomedia/cm-janus/issues/280).

tomaszdurka commented 8 years ago

retryDelay is already high-60 seconds by default and introducing back-off would complicate logic too much. Instead (as discussed with @vogdb) we decided to add configuration option retryDelay for job processor.

Might seem that starting with delay of 5 minutes is a bit harsh, but we actually don't have cases where job fails and then succeeds, but currently only cases where it fails forever. Still let's keep the job failing as a reminder of the problem.

Then we can increase it where logging quota is a problem.

vogdb commented 8 years ago

@tomaszdurka please review

tomaszdurka commented 8 years ago

Looks good, tests are failing.

vogdb commented 8 years ago

@tomaszdurka please look

tomaszdurka commented 8 years ago

lgtm