HangfireIO / Hangfire

An easy way to perform background job processing in .NET and .NET Core applications. No Windows Service or separate process required
https://www.hangfire.io
Other
9.32k stars 1.69k forks source link

Job is executed twice despite DisableConcurrentExecution #1960

Open caveofjulian opened 2 years ago

caveofjulian commented 2 years ago

I am using Hangfire.AspNetCore 1.7.17 and Hangfire.MySqlStorage 2.0.3 for software that is currently in production.

Now and then, we get a report of jobs being executed twice, despite the usage of the [DisableConcurrentExecution] attribute with a timeout of 30 seconds.

It seems that as soon as those 30 seconds have passed, another worker picks up that same job again.

The code is fairly straightforward:

public async Task ProcessPicking(HttpRequest incomingRequest)
{
    var filePath = await StoreStreamAsync(incomingRequest, TriggerTypes.Picking);
    var picking = await XmlHelper.DeserializeFileAsync<Picking>(filePath);
    // delay with 20 minutes so outbound-out gets the chance to be send first
    BackgroundJob.Schedule(() => StartPicking(picking), TimeSpan.FromMinutes(20));
}

[TriggerAlarming("[IMPORTANT] Failed to parse picking message to **** object.")]
[DisableConcurrentExecution(30)]
public void StartPicking(Picking picking)
{
    var orderlinePickModels = picking.ToSalesOrderlinePickQuantityRequests().ToList();
    var orderlineStatusModels = orderlinePickModels.ToSalesOrderlineStatusRequests().ToList();

    var isParsed = DateTime.TryParse(picking.Order.UnloadingDate, out var unloadingDate);

    for (var i = 0; i < orderlinePickModels.Count; i++)
    {
        // prevents bugs with usage of i in the background jobs
        var index = i;
        var id = BackgroundJob.Enqueue(() => SendSalesOrderlinePickQuantityRequest(orderlinePickModels[index], picking.EdiReference));
        BackgroundJob.ContinueJobWith(id, () => SendSalesOrderlineStatusRequest(
        orderlineStatusModels.First(x=>x.SalesOrderlineId== orderlinePickModels[index].OrderlineId), 
        picking.EdiReference, picking.Order.PrimaryReference, isParsed ? unloadingDate : DateTime.MinValue));
    }
}

[TriggerAlarming("[IMPORTANT] Failed to send order line pick quantity request to ****.")]
[AutomaticRetry(Attempts = 2)]
[DisableConcurrentExecution(30)]
public void SendSalesOrderlinePickQuantityRequest(SalesOrderlinePickQuantityRequest request, string ediReference)
{
    var audit = new AuditPostModel
    {
        Description = $"Finished job to send order line pick quantity request for item {request.Itemcode}, part of ediReference {ediReference}.",
        Object = request,
        Type = AuditTypes.SalesOrderlinePickQuantity
    };

    try
    {
        _logger.LogInformation($"Started job to send order line pick quantity request for item {request.Itemcode}.");

        var response = _service.SendSalesOrderLinePickQuantity(request).GetAwaiter().GetResult();
        audit.StatusCode = (int)response.StatusCode;
        if (!response.IsSuccessStatusCode) throw new TriggerRequestFailedException();

        audit.IsSuccessful = true;
        _logger.LogInformation("Successfully posted sales order line pick quantity request to ***** endpoint.");
    }
    finally
    {
        Audit(audit);
    }
}

It schedules the main task (StartPicking) that creates the objects required for the two subtasks:

Send picking details to customer Send statusupdate to customer The first job is duplicated. Perhaps the second job as well, but this is not important enough to care about as it just concerns a statusupdate. However, the first job causes the customer to think that more items have been picked than in reality.

I would assume that Hangfire updates the state of a job to e.g. in progress, and checks this state before starting a job. Is my time-out on the disabled concurrent execution too low? Is it possible in this scenario that the database connection to update the state takes about 30 seconds (to be fair, it is running on a slow server with ~8GB Ram, 6 vCores) due to which the second worker is already picking up the job again?

Or is this a Hangfire specific issue that must be tackled?

AminSojoudi commented 2 years ago

+1 here, we are using Hangfire 1.7.13 and Hangfire.AspNetCore 1.7.13 and Hangfire.Mongo 0.7.12. Some jobs run twice with two server instances where the expected behavior is to run only once. We do not use DisableConcurrentExecution attr. is it needed for this kind of behavior? Is it something that related to HangfireMongo or Hangfire?

caveofjulian commented 2 years ago

+1 here, we are using Hangfire 1.7.13 and Hangfire.AspNetCore 1.7.13 and Hangfire.Mongo 0.7.12. Some jobs run twice with two server instances where the expected behavior is to run only once. We do not use DisableConcurrentExecution attr. is it needed for this kind of behavior? Is it something that related to HangfireMongo or Hangfire?

To be honest, I have found quite some bugs now in Hangfire. I think the fact that people tell us to use DisableConcurrentExecution should have been an indication not to go with Hangfire. Unfortunately that choice is now too late. I implemented in-memory checks before execution of jobs to avoid it. But I do realize that I am in a priviliged situation due to a low amount of traffic (maybe 10K jobs a day). And I do tend to restart the server once every week, so it doesn't abuse RAM too much.

Frazi1 commented 2 years ago

Hangfire is designed to run with at-least-once execution semantics. If it considers your job to be abandoned (for instance, because the processing server was shutdown and stopped sending heartbeats), then it will allow the same job to be picked by another worker.

This behavior highly depends on the storage you're using. With the latest version of SqlServer storage, everything should work like a charm out of the box.

Other storages may have issues. For example, Postgres storage has InvisibilityTimeout setting. If your job is taking longer than the timeout, then it will be constantly duplicated by another worker. Therefore, the solution is to increase the timeout.

I don't know much about MySql storage, though, but for me it seems really unstable. I recommend you to search its documentation for Invisibility Timeout and play around with that setting. If it does not help, it's better to switch to another storage.

flmbray commented 1 year ago

This behavior highly depends on the storage you're using. With the latest version of SqlServer storage, everything should work like a charm out of the box.

It seems to me that different behavior based on the way Hangfire is storing the jobs is not a good design. It certainly is giving me serious thoughts about dropping Hangfire completely.

It's actually quite trivial to demonstrate the multiple-execution behavior - just set up a recurring job (even with Memory Storage) that executes every 1 minute, and the put a Thread.Sleep(70 * 1000) so that it takes 70 seconds. A bit of extra logging easily shows that two of that thread will run at the same time, clearly not what "DisableConcurrentExecution" seems to imply.

resulkilic commented 1 year ago

This was a problem I was having too. I solved this in my own project, I hope it works for you.

See: https://stackoverflow.com/a/74334104/20430576

caveofjulian commented 1 year ago

@resulkilic your code looks promising. I fortunately handed the project over to another programmer. He told me he forked this MySql library and did a fix in the source code itself. Im not sure if he pull requested, probably not. He did it also for a bug that froze all mysql tables when a lot of jobs failed. They become completely unresponsive.

I will never use this library again , its super unstable for bigger projects.

DaGrisa commented 1 year ago

Same problem here, will this be fixed in the future or is it "as designed"? Especially with recurring jobs, this implementation (to block a worker) makes no sense to me.

resulkilic commented 1 year ago

The purpose of this is to prevent it from working again when there is a working job.

DaGrisa commented 1 year ago

Makes sense, but in our case the "waiting" job uses a worker, which leads to filling up all workers with "waiting" jobs and freezing the system.

resulkilic commented 1 year ago

No, that's exactly the point of this filter. Unnecessarily preventing the tail from swelling. I advise you to try. Trigger a task again while running to see the result and you will see it not added.

DaGrisa commented 1 year ago

No, that's exactly the point of this filter. Unnecessarily preventing the tail from swelling. I advise you to try. Trigger a task again while running to see the result and you will see it not added.

Okay, I was talking about the problem in general. The filter does not work with our configuration.

peheje commented 1 year ago

flmbray:

As I understand, when the scheduled job triggers every minute, it triggers a new job with another job id, so it's not the same job. DisableConcurrentExecution won't help for this, reading the source the DB lock it done one the resourceName which includes the job id.

I believe this is what Mutexes are for in their paid version https://docs.hangfire.io/en/latest/background-processing/throttling.html?highlight=disableconcurrentexecution#mutexes

But you could just as well use something like https://github.com/madelson/DistributedLock to make sure you are not running the same job type more than once.

caveofjulian:

I'm unsure exactly how this double processing happened. Could it be that ProcessPicking was called twice? In that case DisableConcurrentExecution wouldn't help you, see above.

It could also happen if Enqueue loop fails half-way through, Hangfire would retry StartPicking from the beginning, sending multiple e-mails twice. I believe there's a small chance for this but of theres many orders to process and the DB is unstable. Could be the cause. I think here their pro feature Batches can be used, which will write many background jobs atomically. Otherwise I think setting a no retry in that one is the safest. And handle errors manually. Or you could get creative..

If the file-size is big, you'll have a huge Picking class, so when you enqueue StartPicking, Hangfire will json serialize this and put it into "Arguments" columns for this job. I suspect this could cause issues if 1 row has a size of multiple megabytes. But I'm unsure.

Gerappa92 commented 1 year ago

This behavior highly depends on the storage you're using. With the latest version of SqlServer storage, everything should work like a charm out of the box.

It seems to me that different behavior based on the way Hangfire is storing the jobs is not a good design. It certainly is giving me serious thoughts about dropping Hangfire completely.

It's actually quite trivial to demonstrate the multiple-execution behavior - just set up a recurring job (even with Memory Storage) that executes every 1 minute, and then put a Thread.Sleep(70 * 1000) so that it takes 70 seconds. A bit of extra logging easily shows that two of that threads will run at the same time, clearly not what "DisableConcurrentExecution" seems to imply.

I have run into this scenario. But in my case, it was placing an attribute in the wrong place. I was creating the recurring job in the following way:

RecurringJob.AddOrUpdate<IWebApiJobExecutor>(job.Id.ToString(), svc => svc.Call(webApiCallback), job.Cron);

And the attribute works for me only when I place it above my interface:

[DisableConcurrentExecution(60)]
internal interface IWebApiJobExecutor

Adding this attribute to the implementation or method didn't work.

Link to stachoverflow

billpeace commented 1 year ago

Please try using the SkipWhenPreviousJobIsRunningAttribute job filter from the referenced gist https:// gist.github.com/odinserj/a6ad7ba6686076c9b9b2e03fcf6bf74e