HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.19k stars 1.68k forks source link

A way to pause/stop jobs from being processed #225

Open nat1craft opened 9 years ago

nat1craft commented 9 years ago

Routinely we get in the situation where we want a specific server to temporarily stop processing jobs. We don't want to remove jobs from queuing up or being processed by other servers, just to take a specific server out of the equation.

Example: Sometimes I have a production server and a development server pointing to the same Hangfire database so we can troubleshoot issues. (I know, I know...) I would like to pause the production server (or maybe pause the dev server) from processing jobs and let the other server handle them. And i don't want to restart the servers in order to do that.

Another Example: Maybe we plan to take a server down for maintenance so we pause it from processing new jobs, wait for it to finish current jobs, and then take it down.

Is there way to programmatically stop the site from processing jobs?

odinserj commented 9 years ago

You can run the static server instance and call its Stop/Start methods. But currently it is possible only with application logic (no built-in support).

guillaumeroy1 commented 9 years ago

Good to know, it would be nice to add the button in the admin panel.

BKlippel commented 8 years ago

Why are these methods deprecated? We need the exact same functionality. Can anyone confirm if they still work? I read in another post that the functionality was disabled.

odinserj commented 8 years ago

Yes, those methods were deprecated. Simply dispose and recreate an instance of the BackgroundJobServer class to use this functionality.

BKlippel commented 8 years ago

But disposing the BackgroundJobServer triggers cancellation and aborts cull currently running jobs. We need for jobs to not start and all jobs already executing to complete. This is the only graceful way to prepare a system with long running jobs for shutdown and code update. It's critical for production systems to have a methodology for draining queues.

guillaumeroy1 commented 8 years ago

I still think the on/off button in the admin panel is a must have. When we deploy a new version of the code, I would like that setting to be remember and I would like to restart the jobs when the code is ready to go live.

@odinserj do you think it could be implemented?

BKlippel commented 8 years ago

Anything? We would gladly contribute source modifications if they would be accepted, otherwise we're close to just creating our own fork.

Thanks, Brian

gandarez commented 8 years ago

I totally agree with @BKlippel

gandarez commented 8 years ago

What's the status of this issue? I had a problem today in production and needed to manually stop the server!!!

PJTewkesbury commented 8 years ago

I agree, A Pause/Resume button on the dashboard would help me out greatly. Can you please include this feature. Thanks.

JefRH1 commented 8 years ago

I would like to see this as well.

CREFaulk commented 8 years ago

I agree with the other posters. If I need to restart the server for some reason there's no graceful way to do it if jobs are in progress.

A way to pause the worker processes, or temporarily set the server to have a count of 0 worker processes would do the trick if it let the existing processing complete.

Has anyone implemented anything to accomplish this on their own?

alastairtree commented 8 years ago

+1

wobbince commented 8 years ago

+1

CREFaulk commented 8 years ago

Here's a crude workaround - Add a wait queue as the first priority and fill it with jobs which do nothing but run until a condition is met. Periodically check for a stored hash value to change and keep running the filler jobs until it does. One filler job per worker. Add an api to pause and unpause processing. If pause state is in the stored hash then generate wait jobs when the service starts; but perhaps they'd still be requeued automatically if they were running at shutdown anyway.

I know, this really is crude. :-) I don't see the queues as more than an attribute or filter in any of the existing classes and the db schema only has queues defined as lists of jobs with nothing for the overall queue attributes.

Alternatively perhaps the worker class could be extended so that workers can be put into a paused state or reduced to 0? There are sleep methods under BackgroundProcessContext but no documentation on what they do.

Edit: The IMonitoringAPI can be used to retrieve server details but changing the WorkersCount does nothing. It was worth a shot. :-)

Final edit: Creating a sleeper class and filling the workers with that did what I needed. I store a pause status in a list and put a flag in cache. The sleepers use Thread.Sleep until the cache flag is gone and then quit. When I restart I re-create the cache flag if the stored flag exists and that does the job. It prevents the database or active processing when I want to shut down at any rate.

markalanevans commented 7 years ago

+1 without being able to let all your workers finish up and then stop taking new jobs, its hard to do a graceful deployment.

gandarez commented 7 years ago

+1

tomasfalt commented 7 years ago

+1

dlazzarino commented 7 years ago

+1

danielcor commented 7 years ago

+1 - this would be very useful in our environment.

cottsak commented 7 years ago

@odinserj This thread seems to suggest there is no consideration in Hangfire for a deployment solution for jobs in progress. I reject this purely on the age and maturity of this platform. What is the official best practise for production deployments and ensuring data consistency?

cottsak commented 7 years ago

But disposing the BackgroundJobServer triggers cancellation and aborts cull currently running jobs. We need for jobs to not start and all jobs already executing to complete. This is the only graceful way to prepare a system with long running jobs for shutdown and code update. It's critical for production systems to have a methodology for draining queues.

@BKlippel I'm not sure I agree. If Hangfire guarantees to run a job "exception free" then any state that would cause a problem if it became inconsistent should be designed with ACID principals in mind, right?

So the way I see it is, that if my jobs are designed with ACID/transaction projection in mind (for those that need it), then if their threads are killed mid-processing then after the deployment, Hangfire will re-queue and execute said job again. In this case the failed/incomplete invocation won't leave an inconsistency because it's been designed not to. Is there still a problem here?

cottsak commented 7 years ago

As for a "Stop/Start" feature or workaround, one might look at the design choice and @odinserj's suggestion to dispose and re-create the BackgroundJobServer as "making this harder" but I don't think so. I think the design reveals that you simply don't need to.

Pausing one job or a specific subset is one thing: use filters or custom logic in your job entry point. But a top-level/BackgroundJobServer "Stop/Start" feature makes no sense if it's to solve the deployment concern. You simply don't need to. Every web node will shut down BackgroundJobServer when recycled and Hangfire will guarantee the re-queuing and invocation of jobs that didn't complete or those that failed. All sensitive state will be designed with ACID principals and so the missing work mid-deployment will be completed after the deployment.

Let's address the OPs original concrete scenarios:

Example: Sometimes I have a production server and a development server pointing to the same Hangfire database so we can troubleshoot issues. (I know, I know...) I would like to pause the production server (or maybe pause the dev server) from processing jobs and let the other server handle them. And i don't want to restart the servers in order to do that.

Here you're wanting to essentially "take one server out of the load" like you might with a load balancer. I believe filters can achieve this. It's not built in but you can do it and filter out a server from consuming from a specified queue.

Another Example: Maybe we plan to take a server down for maintenance so we pause it from processing new jobs, wait for it to finish current jobs, and then take it down.

What does "maintenance" mean or why does "maintenance" require the need to stop processing jobs? Because if it's a deployment scenario we've covered that above. Usually a "maintenance mode" flag is designed to allow the production environment to run but prevent state change/input from regular users. Maybe admins want to try something with no other changes happening at the same time. Well in these cases, I'd say largely that there would only be a subset of background job types that would be affected by this. Many background jobs won't mutate anything while there is no user initiated mutation. However some will. For that subset, use filters or other techniques suggested to "pause" just those. Many folks won't have this category of job at all so I can see why the application of pragmatism has prevented a root level "pause" feature of all jobs.

dmartensson commented 7 years ago

We also see the need for a "Stop accepting new jobs" option.

We run Hangfire as a standalone server in a cluster of currently 2 servers, not dependent on iis or asp.net so recycles are of no concern for us and many processes we plan to move to Hangfire are today running as standalone services.

Most are not designed to handle thread abort problems and redesigning them are in some cases impractical due to the nature of the job and graceful shutdown requires the jobs to finish writing processed data (we are calling an external api and need to log the results reliably).

Hangfire kills the threads after 12-18 seconds from calling dispose which can be to little time as we have to wait for the external Api return value and then commit that to the database.

Since our goal is to be able to shutdown the servers in a rolling schedule to never disable the service completely we cannot use any scheduling magic since we cannot prevent the local server from picking new jobs without preventing all other servers from picking them up.

We was thinking on using a hack based on disposing the server without closing the process which allows running jobs to continue, but since this also re queues the jobs directly another server might pickup the job causing concurrent execution which is not wanted.

Currently our only option seems to be to try to reduce the amount of state we keep to minimize data loss in case of shutdown, but we cannot prevent it reliably which is a shame.

s123klippel commented 7 years ago

OK, It's been a while, and I will happily tell everyone what we have done to mostly work around these issues.

We now deploy into 2 queue "channels", call the "A" and "B" if you like. This isn't perfect, as you still need to account for the spill over as it were, and not overwrite an active channel that has not finished draining. However, what it does allow is for a significant code change to divert new work to the alternate channel. Our channels are defined in our connection strings, we have a hangfire DB per channel. When we deploy (we use octopus), if we need to update code without interruption of jobs that are currently long running, we deploy to the next (or simply alternate) channel. Prior jobs keep running, new jobs take advantage of the new code. Of course our deployment synchronizes the submission channel of the web servers to that of the background services. We configure our services so they are named by channel and coexist on the relevant app servers. The old channels can then drain, or be shut down.

Do with this as you may, but keep in mind there's not really a pure "exception free" queue. Queue's need to be cancel-able, and drain-able while also being durable to code updates. If we can't pause a queue without interrupting jobs in motion then we need an alternative. Sergey has created something truly awesome here, you're just missing some of the fine print. You have built for "development", but this is ultimately a "devOps" tool. This is competing with the likes of rabbitMQ, that's a big deal. The suggestion for pause is not that jobs would be interrupted, just that the queue would pause being queried and drain. Matt, I appreciate what you are trying to explain, but you aren't considering the consequences of completely restarting jobs when the queue restarts. In very simple cases it's probably not a big deal. But there are many cases introduced by using a product as flexible as this, where updated code will no longer be compatible with the the data of a previously submitted job. In that case ACID fails anyway. You would want the ability to segregate that queue. That's a tall order, so what people have been asking for is more simply the option to pause and drain a queue, holding new submissions (maybe even version them) to a new queue processor to be introduced soon after. I don't know what to say if you don't see this outcome, it's fairly common. My rough workaround solves this, use as many channels as you need, but it would be a nice feature for the queue to support intrinsically.
Ill just conclude with Good Job Sergey, we still love it.

blyry commented 7 years ago

@s123klippel @dmartensson that's pretty much what we did as well. We centralized a lot of console app scheduled task type batch jobs into hangfire, but many of them are long running (30-40 minutes) worth of data munging. We could rewrite them to support checkpoints, break them into smaller tasks...or do graceful restarts of hangfire. Graceful restarts seemed much easier ;)

ON app startup after deployment we read the deployed 'environment', which can really be any string, we only deploy once or twice a day so it just alternates between blue and green. So if the current deployment is blue the application instance then listens for jobs on the blue queue.

The second step is to only enqueue jobs to the 'active' queue, which is stored in the database and in cache, and toggled after a deployment. Each deployment has 2 instances, for a total of potentially 4 nodes connected to the hangfire database. With hangfire, every node connected to the database will pick up and try to enqueue a 'scheduled' jobs so there has to be a central source of truth for what queue the scheduled jobs go to.

The environment switching, plus deploying to a new server lets us gracefully drain the old deployment before shutting it down and works really well. Only issue we run into is this guy -- https://discuss.hangfire.io/t/failed-can-not-change-the-state-of-a-job-to-enqueued-target-method-was-not-found/122 -- because of the mechanism for enqueing scheduled jobs an old node will try to schedule a brand new job that doesn't exist in the previous codebase and it will fail once or twice before the new deployment picks it up for scheduling.

    Hangfire.GlobalConfiguration.Configuration.UseFilter<HfEnvServerFilter>(new HfEnvServerFilter(()));
    var bjso = new BackgroundJobServerOptions() { Queues =  new List<string>(2) { Environment.MachineName.ToLower(), WebConfiguration.AppSettings["DeployedEnvironment"] }; };
    app.UseHangfireServer(bjso);
    addAndUpdateScheduledJobs()
    public class HfEnvServerFilter : IElectStateFilter
    {
        //IEnvironmentSwitcher can pull the 'active' environment from cache and db, it is toggled after a new deploy
        private IEnvironmentSwitcher _envSwitcher;
        public HfEnvServerFilter (IEnvironmentSwitcher envSwitcher)
        {
            _envSwitcher = envSwitcher;
        }

        public void OnStateElection(ElectStateContext context)
        {
            //Flaky Cache? Falling back to current deployed environment in those situations.
            var activeEnvironment = (_envSwitcher.TryGetActiveEnvironment() ?? _envSwitcher.TheCurrentEnvironment).ToString().ToLower();            

                var enqueuedState = context.CandidateState as EnqueuedState;
            if (enqueuedState == null)
                return;

            //support our custom queue name attribute
            var queueNameAttributes = context.BackgroundJob.Job.Method.DeclaringType.CustomAttributes
                .Union(context.BackgroundJob.Job.Method.CustomAttributes)
                .Where(attr => attr.AttributeType == typeof(Utility.Hangfire.RunOnQueueAttribute))
                .SelectMany(attr => attr.NamedArguments)
                .Where(arg => arg.MemberName == "QueueName");

            if (queueNameAttributes.Any()) {
                enqueuedState.Queue = queueNameAttributes.Last().TypedValue.Value.ToString();
            }
            else
                enqueuedState.Queue = activeEnvironment;
        }
    }
dmartensson commented 7 years ago

@blyry That does not solve how to trigger the cancellation gracefully (not killing threads) and also, jobs that are requeued by hangfire according to this bug https://github.com/HangfireIO/Hangfire/pull/502 always get queued on the default queue and would not be picked up at all then?

blyry commented 7 years ago

The filter I posted takes care and seems similar to the workaround proposed in #502. And you're right, we don't solve the graceful cancellation problem, but our dual deployment made it unnecessary to gracefully cancel anything. Old deployments run until they are empty and then they are killed.

Sometimes deployments still get recycled / iisreset, for sure this doesn't solve the problem or need for a better graceful shutdown mechanism, but it's been an acceptable work around for us. Basic graceful shutdown support would either have 2 flavors, right? 1) shutdown when finished processing, or 2) shutdown as soon as possible. So we technically support 1, but not 2.

dmartensson commented 7 years ago

Think I found a solution to graceful shut-down.

There appears to to exist a new replacement to stop/start in "sendstop" on the service instance object.

I have tested it and it sets the cancellation token but does not force running threads to abort.

It also makes the local server instance to not pick up new jobs.

So doing this and then waiting until the server instance has no running jobs should make a graceful shut-down possible.

Is my assumptions on sendstop correkt or am I missing something.

Sendstop is not marked as deprecated.

blyry commented 7 years ago

nice good find! Added in 1.6.0, so available from Jul 15, 2016. Googling for that led me to this thread -- https://discuss.hangfire.io/t/ability-to-stop-all-the-server-instances-during-deployments/2285 where @odinserj explained about it and sortof what we're trying to do here.

Is it just as easily to start the server back up? Do you have to call dispose after sendStop, or can you call Start at a later date on the same server instance and everything works fine?

dmartensson commented 7 years ago

@blyry I do not think you can start it again, there are no equal sendstart so I assume its only for closing down the server.

And since it is a disposable object you should dispose it, but ending the process will dispose it anyway.

I am building an implemention this week to try it full scale.

dmartensson commented 7 years ago

Finding the jobs running on the local server proved to be difficult.

I found that BackgroundJobServer takes an option object where I can set ShutdownTimeout, but that still only sets a fixed time.

Through reflection I managed to get the Task object that is actually running jobs and thereby I can choose whether to force a close by disposing with default shutdowntimeout or wait on my own code for jobs to finish.

This means that a controlled shutdown is possible for deployment (by triggering close through messaging) while a service close (for example if the machine is shutting down) will wait at most 15 seconds before aborting threads.

This will be sufficient for my purposes even though it would have been nice to be able to see what jobs was still running locally since we then would have been able to time shutdown times for jobs and see which jobs are giving us problems.

grexican commented 6 years ago

+1 can't believe we don't have a solution for this and that the ticket is being ignored.

Some of my jobs take hours to run. I want those to finish, but not pick anything else new up, so I can then update the system. I set status points throughout the process, so when it resumes it doesn't restart the ENTIRE mulit-hour process. But even the best breaks in the process still result in hours of redundant calculation if I stop in the middle.

sohaibjaved44 commented 6 years ago

Any update on Pause/Stop? much needed functionality

blyry commented 6 years ago

Our workaround is to use filters and deploy to named environments. We use the git sha of the build, but an incrementing build id would work to. So our current build is r8931-d6cd1e2bd1, and inside our memcache that is marked as the 'active' queue. A filter enqueues new jobs on that 'active' queue. A thread runs alongside each server and checks to see if that server is a) active or not and b) if not 'active' or 'next' if there are any jobs processing on the queue that matches that servers build/deployment id. So our previous release, r8930-de1eaf515e, iis will continue to run until the thread sees there are no in-progress jobs on that queue then it shuts down the process.

Pause functionality for scheduled/cron jobs was with two pieces -- 1) an aspx page that is just a CRUD view of the 'keys' for the scheduled jobs and a 'pause' bit, that is stored in sql server. Then there is a custom filter that cancels the jobs when hangfire schedules them for execution, if they are marked as paused in our custom table.

blyry commented 6 years ago

Our pause implementation uses IClientFilter and cancels them but getting the job name was kindof gross -- if I did it again I would look at creating a custom PausedState that is final and using an IElectStateFilter.

http://api.hangfire.io/html/T_Hangfire_States_IState.htm

IronSean commented 6 years ago

+1, this isn't a priority until suddenly jobs are cascading in failure and exponentially growing and it suddenly is

Flydancer commented 5 years ago

+1, still no development on this? Start/Stop is a must have imho

plaisted commented 5 years ago

Here's a basic way to gracefully drain I've been using. I added a simple IElectStateFilter that does a delayed schedule of the job instead of processing it.

    public class PausedWorkerFilter : IElectStateFilter
    {
        private IPauseSource _source;

        public PausedWorkerFilter(IPauseSource source)
        {
            _source = source;
        }
        public void OnStateElection(ElectStateContext context)
        {
            var processing = context.CandidateState as ProcessingState;
            if (processing == null || !_source.IsWorkerPaused())
            {
                return;
            }

            context.CandidateState = new ScheduledState(TimeSpan.FromMinutes(_source.PauseDelayMinutes)) { Reason = "Worker is paused." };
        }
    }

For now my CI/CD is just dropping a file to disk at the beginning of deployments to trigger the follwing IPauseSource:

    public class FilePauseSource : IPauseSource
    {
        private FilePauseOptions _options;

        public int PauseDelayMinutes => _options.PauseMinutes;
        public FilePauseSource(IOptions<FilePauseOptions> options)
        {
            _options = options.Value;
        }

        public bool IsWorkerPaused()
        {
            return File.Exists(_options.TriggerFilePath);
        }
    }

It's not the most elegant solution and if other workers are available to process jobs it's not very efficient (jobs get scheduled with a delay instead of just being run on another worker) but solves the problem for gracefully draining workers during a deployment and is much simpler than some of the other methods suggested. Be careful if you are using multiple queues as scheduled jobs lose their queue if you aren't preserving it through an attribute or other method (I use a queue parameter to make the queue sticky).

DaRochaRomain commented 3 years ago

Any update on this? Or is this code still worth using to send stop signal to all servers before doing a deployment?

zlangner commented 2 years ago

+1 for my purposes just having a way to signal to a specific worker to stop taking new work would be sufficient.

The way we do it is we create new EC2s in AWS with each deploy. So if we had a way to signal the old ones to stop taking new work then start up the new ones, we could release with minimal downtime because the existing workers would continue to process what they have and eventually just run out of work and go idle. The new ones would spin up and handle any new requests. Then after some acceptable timeout the release pipeline would take down the old serves.

In short, the mechanism should prevent the worker from taking any new job or processing any retries. It's really finish what you have and do nothing more.

mbalatha commented 1 year ago

+1 Drain mode is a great feature to put HF server(s) to stop accepting new job(s) while processing jobs are completed. Either through Dashboard or ci/cd friendly support will be great addition.

hade94 commented 1 year ago

+1 Would love a built in feature like this.

stupied4ever commented 8 months ago

+1 Would love a built in feature like this.

douglasg14b commented 1 month ago

Damn, this feature request is about to enter middle school. Reading through the history there is a strong need for a stable and consistent way to stop/pause hangfire jobs during a deployment. One that doesn't just start breaking between versions due to a dependence on a hack or esoteric API.

We need this during deployments when the application is in maintenance mode, not being able to do graceful deployments if you use Hangfire is a really a detractor for using Hangfire.

The forum thread here https://discuss.hangfire.io/t/ability-to-stop-all-the-server-instances-during-deployments/2285/4

Appears to be unresolved, with the core problem remaining in that even the kludges don't seem to work reliably.