Job creation resiliency

barclayadam commented 5 years ago

If the database server is under load, the connection fails or any other number of reasons the connections fail the job will not be queued and and exception thrown.

Would it be sensible for Hangfire itself to be more resilient to these failures? Either providing built-in retry support as a minimum (something configured once for a client, instead of application code making sure it provides the retry mechanism everywhere), or perhaps more interesting a store and forward mechanism.

Issue https://github.com/HangfireIO/Hangfire/issues/820 mentions handling the BackgroundJobClientException manually, but this feels like a core concern given distributed-by-nature functionality of Hangfire. Perhaps pluggable, perhaps dependent on individual transports (i.e. SQL vs Redis)?

odinserj commented 5 years ago

Timeout is the most interesting failure that happens often in cloud environments, and when it happens it quite hard to understand whether background job was created or not – in the former case we must not create yet another background job.

I didn't implement this feature before as a part of IBackgroundJobFactory implementation directly, because was thinking it requires an idempotence key to be added first, to not to create background job twice, because in case of timeout we don't even have a background job identifier.

But since background job is created in a two-step process, and no one is able to see background job is created before the second step is successfully completed, we can implement this behavior transparently in the following way, depending on where fault is happened.

CreateExpiredJob – simply try again and call this method second time. This is safe, because no one except our current thread will see the results of the first attempt, whether it was succeeded or not. Even if it was created during the first attempt, its state will not be initialized, and moreover, it will be automatically removed in future.
Transaction.Commit – on retry attempt we should get the job details and see its state name. If it's null, then we should retry the whole state change attempt. If it's not null, then our previous attempt was succeeded and we shouldn't do anything. We can even avoid taking a distributed lock here, because we'll only act when current state is null.

However it's possible that transaction.Commit call is failed, but some side effects were committed. And theoretically there may be problems with skipped job continuations.

barclayadam commented 5 years ago

Thank for your response, certainly seems like a simple retry mechanism would be relatively straightforward to introduce an amount of resiliency. It could handle the occasional blip in connectivity which is pretty common in a cloud environment.

If this gets implemented should it be considered how something like store-and-forward could be integrated, whether it's built-in or a 3rd party extension?

Although an in-memory retry should alleviate many issues, what if Hangfire dependencies are completely down for a period of time? For example, we have a dedicated database for our Hangfire instance. Our site could completely work without the background process if we were able to temporarily store the tasks to be pushed to Hangfire and retried later.

HangfireIO / Hangfire

Job creation resiliency #1434