Performance degradation when using large amount of throttled jobs.

mmurrell commented 2 years ago

I have found a challenging use-case when using Throttled jobs, and I fear it may be by design.

I have a queue, which currently runs 20 threads on 2 servers, for a max of 40 concurrent jobs. Generally, all the tasks in this queue run fairly fast so I throw everything in there. I use throttling to limit jobs by type, so one task does not preempt other waiting tasks. But now I have a new use-case where I'd like to schedule 100,000 jobs, and have them limited to 10 running concurrently, and just work them throughout the day.

What I've found though, is when the jobs are limited by the semaphore, they get put back into the "Scheduled" bucket for a retry. Every polling interval, the jobs go from Scheduled -> Enqueued -> Limited by the Semaphore -> Scheduled, and the time it takes to make all these changes causes the smaller, non-throttled jobs to get backed up. This is exceptionally slow with Sql Server storage, and still painful with the speed Redis storage offers.

The most trivial option is to move it to its own queue, but previously I had found challenges having so many different queues, with different worker counts, all running their own BackgroundJobServers and the connections they require. Because of that pain, we purchased the 'Pro' version, moved everything to a single wider queue, and use throttling to govern our jobs. I'd really like to see throttling work in a performant manner without carving this out into its own queue.

I was curious if you have ever attempted to put throttled jobs in their own state 'Blocked' or 'Throttled' where they wouldn't clog up the 'Scheduled' state. Then whenever a job with a throttle completes, a filter could pull up the next job from the 'Blocked' state and put in 'Enqueued'. Changing this from a polling mechanism to an event-driven mechanism would probably increase throttled job throughput, while still enforcing the limitations I don't know if this is even possible, but I'd be willing to attempt it if you might consider a pull request?

Here is a quick repro case, starting with the package versions that were used in the test below.

    <PackageReference Include="Hangfire.Core">
      <Version>1.7.19</Version>
    </PackageReference>
    <PackageReference Include="Hangfire.Pro.Redis.SEv2">
      <Version>2.8.10</Version>
    </PackageReference>
    <PackageReference Include="Hangfire.Throttling">
      <Version>1.3.0</Version>
    </PackageReference>

Here is some sample code:

_highAvailabilityServer = new BackgroundJobServer(new BackgroundJobServerOptions
{
    Queues = new[] { BackgroundQueues.HighAvailability },
    WorkerCount = 20,
});

// Creates some manually triggered jobs
public static void Setup() 
{
    var throttlingManager = new ThrottlingManager(JobStorage.Current);
    throttlingManager.AddOrUpdateSemaphore("SlowyMcSlowface", new SemaphoreOptions(10));
    throttlingManager.AddOrUpdateSemaphore("NormalBusiness", new SemaphoreOptions(4));
    RecurringJob.AddOrUpdate("QueueLongRunningTask", () => QueueLongRunningTask(), Cron.Never(), null, BackgroundQueues.HighAvailability);
    RecurringJob.AddOrUpdate("QueueFastTask", () => QueueFastTask(), Cron.Never(), null, BackgroundQueues.HighAvailability);
}

public static Random rng = new Random();

public static void QueueLongRunningTask()
{
    var bjc = new BackgroundJobClient();
    foreach (var dummy in Enumerable.Range(0, 2000))
        bjc.Create(() => LongRunningTask(), new EnqueuedState(BackgroundQueues.HighAvailability));
}

public static void QueueFastTask()
{
    var bjc = new BackgroundJobClient();
    foreach (var dummy in Enumerable.Range(0, 10))
    {
        var x = rng.Next(0, 5);
        bjc.Create(() => FastTask(x), new EnqueuedState(BackgroundQueues.HighAvailability));
    }
}

[Semaphore("SlowyMcSlowface")]
public static void LongRunningTask() => Thread.Sleep(8000 + rng.Next(0, 10000));

[Semaphore("NormalBusiness")]
[Mutex("normal:{0}")]
public static void FastTask(int x) => Thread.Sleep(500 + rng.Next(0, 2000));

To reproduce, trigger the 'QueueLongRunningTask' and allow it to move the tasks to scheduled. You may choose to do this more than once to get the queue bigger. Then manually trigger the 'QueueFastTask` scheduled job once or twice. Find and view the job details for any FastTask job. You should see that they will get "stuck" behind the LongRunningTask jobs and even when there are worker threads available, they will delay for several minutes before they are picked up.

If you need more details, I would be happy to share relevant logfiles. Thanks for your consideration.

odinserj commented 2 years ago

Event-driven architecture for semaphores and mutexes requires the lock feature to be implemented in transaction scope (currently they are supported only on a connection level). Without this it will be possible for locks to be released before its background job was fully completed, and retry will cause such a background job to be performed outside of its semaphore or mutex. So in this case causality can be violated for regular cases, making throttlers less usable.

Support for transaction-scoped locks is already committed to the dev branch and will be released with Hangfire 1.8.0. Then, once this feature is implemented in official storages, it will be possible to implement event-driven architecture for throttlers. But please note Hangfire 1.8 will be released in autumn, and I'm afraid there's no quick alternative to this.

Also it will be possible in 1.8.0 to avoid one state transition in case of a re-scheduling, reducing overall delays – Scheduled → Enqueued → Processing state transition can be replaced with direct Scheduled → Processing in this version when specifying queue explicitly.

masbed commented 2 years ago

Any news on this? We are facing the same issue. We have the need to enqueue jobs for each item in a batch with about 35k items, and after doing some testing I'm quite worried about how it would affect performance even though we are using Redis. Right now I'm leaning towards scheduling instead of enqueueing the items, spreading them out over time manually and then potentially adding a semaphore as a failsafe in case a hand full would take much longer than the average, but planning for the spread to be wide enough so that the semaphore really shouldn't be needed. It's a pretty clunky solution, and because of the polling behavior, the expected processing time and the desired load, I would need to decrease the SchedulePollingInterval to 5 seconds or even lower, and I'm still not sure how it would affect behavior.

HangfireIO / Hangfire

Performance degradation when using large amount of throttled jobs. #1921