HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.41k stars 1.7k forks source link

Allow access to whether a job is running, or the cancellation token has fired #2031

Open dazbradbury opened 2 years ago

dazbradbury commented 2 years ago

Following https://github.com/HangfireIO/Hangfire/issues/2026, it became clear that the following call:

JobStorage.Current.GetMonitoringApi().ProcessingJobs(...);

Includes jobs where the cancellation token has fired, and OperationCanceledException thrown. In other words, it includes jobs that aren't actually running.

In order to be able to work out, and alert about, any jobs where the cancellation tokens didn't fire, it would be extremely useful if there was a way to obtain only actually running jobs, or filter out jobs that have thrown the OperationCanceledException from the ProcessingJobs list.

This would allow, for example, the ability to ignore certain jobs if they are stopped non-gracefully whilst alerting / flagging other jobs.

There is a current workaround, in that where a cancellation token is thrown, Hangfire will log:

Worker stop requested while processing background job 'XXX'. It will be re-queued.

However, this means comparing the running list, to the log messages, so doesn't allow for conditional alerting / logic depending on the true state of the jobs during shutdown/cancellation events. Without being able to see the true running list, the choice is to alert about every time hangfire server doesn't shut down gracefully, or never alert about it. It's not possible to ignore ungraceful shutdowns on particular jobs, for example.

dazbradbury commented 2 years ago

@odinserj - How feasible do you think it would be to provide a way to obtain the running jobs list, whilst ignoring cancelled jobs?

odinserj commented 2 years ago

Database and application state are always potentially unsynchronised, because we are in the environment when every next line of code is potentially not executed due to unexpected process shutdown. It's simply impossible to ideally synchronise two different physical entities at different points of state that can be queried independently.

I see two options of maintaining more or less synchronised list of processing jobs:

  1. Connect to concrete application processes, for example implement application as a ASP.NET Core web application that runs Hangfire Server and that can return list of currently processed background jobs – they can report to some application state when background method is started or stopped execution.
  2. Add heartbeat functionality for background jobs, for example by using a server filter that creates some observer thread/task before job is started and repeatedly reports heartbeats while it's still running with some interval, e.g. each 30 seconds. When there are no heartbeats in last X seconds, you can consider method as aborted.

Both tasks can be created as an extension filter, like CaptureCultureAttribute, and since ideal implementation with perfect synchronised list is not possible anyway, this shouldn't be a part of Hangfire.Core – ProcessingJobs method currently works like (2) anyway, and eventually aborted jobs will be processed again.

dazbradbury commented 2 years ago

Thanks for your detailed response - I was wondering if it's feasible for the following flow:

1) Process cancellation token is fired, and OperationCanceledException captured by hangfire for the particular job 2) As currently, this is logged with:

Cancellation token fired and handled by job. Worker stop requested while processing background job 'XYZ'. It will be re-queued.

3) A flag is set in the database to state this job was cancelled 4) When the job is re-queued, this flag is reset

If (3) doesn't happen, then we're in the same state as now, so no harm done. If (4) fails, then much like a job failure it can be re-attempted without any side-effects / harm done.

Now when the call is made to:

JobStorage.Current.GetMonitoringApi().ProcessingJobs(...);

It can also return the state of this flag (as set in the DB), which would determine if a job has been cancelled successfully or not. Does this seem feasible? Whilst there is no guarantee of synchronisation here, it does provide the additional information where possible.

odinserj commented 2 years ago

May be, but it's totally unclear where to write this flag – ProcessingJobs index doesn't have anything for data, and can't have in the current implementation and APIs. With JobParameters table it will be hard to understand to which execution it relates. So I don't see any natural solution for this that's general.

For a particular application and use case this can be done with a server filter that intercepts OnPerformed phase and checks whether there's an OperationCanceledException and context.Stopping token is activated, and records this flag somewhere, maybe even in the JobParameters table, but there might be some cases when it's reported as canceled, but actually running now.