HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.44k stars 1.71k forks source link

Job creation resiliency #1434

Open barclayadam opened 5 years ago

barclayadam commented 5 years ago

If the database server is under load, the connection fails or any other number of reasons the connections fail the job will not be queued and and exception thrown.

Would it be sensible for Hangfire itself to be more resilient to these failures? Either providing built-in retry support as a minimum (something configured once for a client, instead of application code making sure it provides the retry mechanism everywhere), or perhaps more interesting a store and forward mechanism.

Issue https://github.com/HangfireIO/Hangfire/issues/820 mentions handling the BackgroundJobClientException manually, but this feels like a core concern given distributed-by-nature functionality of Hangfire. Perhaps pluggable, perhaps dependent on individual transports (i.e. SQL vs Redis)?

odinserj commented 5 years ago

Timeout is the most interesting failure that happens often in cloud environments, and when it happens it quite hard to understand whether background job was created or not – in the former case we must not create yet another background job.

I didn't implement this feature before as a part of IBackgroundJobFactory implementation directly, because was thinking it requires an idempotence key to be added first, to not to create background job twice, because in case of timeout we don't even have a background job identifier.

But since background job is created in a two-step process, and no one is able to see background job is created before the second step is successfully completed, we can implement this behavior transparently in the following way, depending on where fault is happened.

However it's possible that transaction.Commit call is failed, but some side effects were committed. And theoretically there may be problems with skipped job continuations.

barclayadam commented 5 years ago

Thank for your response, certainly seems like a simple retry mechanism would be relatively straightforward to introduce an amount of resiliency. It could handle the occasional blip in connectivity which is pretty common in a cloud environment.

If this gets implemented should it be considered how something like store-and-forward could be integrated, whether it's built-in or a 3rd party extension?

Although an in-memory retry should alleviate many issues, what if Hangfire dependencies are completely down for a period of time? For example, we have a dedicated database for our Hangfire instance. Our site could completely work without the background process if we were able to temporarily store the tasks to be pushed to Hangfire and retried later.