Open caveofjulian opened 2 years ago
+1 here, we are using Hangfire 1.7.13
and Hangfire.AspNetCore 1.7.13
and Hangfire.Mongo 0.7.12
.
Some jobs run twice with two server instances where the expected behavior is to run only once.
We do not use DisableConcurrentExecution
attr. is it needed for this kind of behavior?
Is it something that related to HangfireMongo
or Hangfire
?
+1 here, we are using
Hangfire 1.7.13
andHangfire.AspNetCore 1.7.13
andHangfire.Mongo 0.7.12
. Some jobs run twice with two server instances where the expected behavior is to run only once. We do not useDisableConcurrentExecution
attr. is it needed for this kind of behavior? Is it something that related toHangfireMongo
orHangfire
?
To be honest, I have found quite some bugs now in Hangfire. I think the fact that people tell us to use DisableConcurrentExecution should have been an indication not to go with Hangfire. Unfortunately that choice is now too late. I implemented in-memory checks before execution of jobs to avoid it. But I do realize that I am in a priviliged situation due to a low amount of traffic (maybe 10K jobs a day). And I do tend to restart the server once every week, so it doesn't abuse RAM too much.
Hangfire is designed to run with at-least-once
execution semantics. If it considers your job to be abandoned (for instance, because the processing server was shutdown and stopped sending heartbeats), then it will allow the same job to be picked by another worker.
This behavior highly depends on the storage you're using. With the latest version of SqlServer
storage, everything should work like a charm out of the box.
Other storages may have issues. For example, Postgres
storage has InvisibilityTimeout
setting. If your job is taking longer than the timeout, then it will be constantly duplicated by another worker. Therefore, the solution is to increase the timeout.
I don't know much about MySql
storage, though, but for me it seems really unstable. I recommend you to search its documentation for Invisibility Timeout and play around with that setting. If it does not help, it's better to switch to another storage.
This behavior highly depends on the storage you're using. With the latest version of
SqlServer
storage, everything should work like a charm out of the box.
It seems to me that different behavior based on the way Hangfire is storing the jobs is not a good design. It certainly is giving me serious thoughts about dropping Hangfire completely.
It's actually quite trivial to demonstrate the multiple-execution behavior - just set up a recurring job (even with Memory Storage) that executes every 1 minute, and the put a Thread.Sleep(70 * 1000) so that it takes 70 seconds. A bit of extra logging easily shows that two of that thread will run at the same time, clearly not what "DisableConcurrentExecution" seems to imply.
This was a problem I was having too. I solved this in my own project, I hope it works for you.
@resulkilic your code looks promising. I fortunately handed the project over to another programmer. He told me he forked this MySql library and did a fix in the source code itself. Im not sure if he pull requested, probably not. He did it also for a bug that froze all mysql tables when a lot of jobs failed. They become completely unresponsive.
I will never use this library again , its super unstable for bigger projects.
Same problem here, will this be fixed in the future or is it "as designed"? Especially with recurring jobs, this implementation (to block a worker) makes no sense to me.
The purpose of this is to prevent it from working again when there is a working job.
Makes sense, but in our case the "waiting" job uses a worker, which leads to filling up all workers with "waiting" jobs and freezing the system.
No, that's exactly the point of this filter. Unnecessarily preventing the tail from swelling. I advise you to try. Trigger a task again while running to see the result and you will see it not added.
No, that's exactly the point of this filter. Unnecessarily preventing the tail from swelling. I advise you to try. Trigger a task again while running to see the result and you will see it not added.
Okay, I was talking about the problem in general. The filter does not work with our configuration.
flmbray:
As I understand, when the scheduled job triggers every minute, it triggers a new job with another job id, so it's not the same job. DisableConcurrentExecution won't help for this, reading the source the DB lock it done one the resourceName which includes the job id.
I believe this is what Mutexes are for in their paid version https://docs.hangfire.io/en/latest/background-processing/throttling.html?highlight=disableconcurrentexecution#mutexes
But you could just as well use something like https://github.com/madelson/DistributedLock to make sure you are not running the same job type more than once.
caveofjulian:
I'm unsure exactly how this double processing happened. Could it be that ProcessPicking was called twice? In that case DisableConcurrentExecution wouldn't help you, see above.
It could also happen if Enqueue loop fails half-way through, Hangfire would retry StartPicking from the beginning, sending multiple e-mails twice. I believe there's a small chance for this but of theres many orders to process and the DB is unstable. Could be the cause. I think here their pro feature Batches can be used, which will write many background jobs atomically. Otherwise I think setting a no retry in that one is the safest. And handle errors manually. Or you could get creative..
If the file-size is big, you'll have a huge Picking class, so when you enqueue StartPicking, Hangfire will json serialize this and put it into "Arguments" columns for this job. I suspect this could cause issues if 1 row has a size of multiple megabytes. But I'm unsure.
This behavior highly depends on the storage you're using. With the latest version of
SqlServer
storage, everything should work like a charm out of the box.It seems to me that different behavior based on the way Hangfire is storing the jobs is not a good design. It certainly is giving me serious thoughts about dropping Hangfire completely.
It's actually quite trivial to demonstrate the multiple-execution behavior - just set up a recurring job (even with Memory Storage) that executes every 1 minute, and then put a Thread.Sleep(70 * 1000) so that it takes 70 seconds. A bit of extra logging easily shows that two of that threads will run at the same time, clearly not what "DisableConcurrentExecution" seems to imply.
I have run into this scenario. But in my case, it was placing an attribute in the wrong place. I was creating the recurring job in the following way:
RecurringJob.AddOrUpdate<IWebApiJobExecutor>(job.Id.ToString(), svc => svc.Call(webApiCallback), job.Cron);
And the attribute works for me only when I place it above my interface:
[DisableConcurrentExecution(60)]
internal interface IWebApiJobExecutor
Adding this attribute to the implementation or method didn't work.
Please try using the SkipWhenPreviousJobIsRunningAttribute job filter from the referenced gist https:// gist.github.com/odinserj/a6ad7ba6686076c9b9b2e03fcf6bf74e
I am using Hangfire.AspNetCore 1.7.17 and Hangfire.MySqlStorage 2.0.3 for software that is currently in production.
Now and then, we get a report of jobs being executed twice, despite the usage of the [DisableConcurrentExecution] attribute with a timeout of 30 seconds.
It seems that as soon as those 30 seconds have passed, another worker picks up that same job again.
The code is fairly straightforward:
It schedules the main task (StartPicking) that creates the objects required for the two subtasks:
Send picking details to customer Send statusupdate to customer The first job is duplicated. Perhaps the second job as well, but this is not important enough to care about as it just concerns a statusupdate. However, the first job causes the customer to think that more items have been picked than in reality.
I would assume that Hangfire updates the state of a job to e.g. in progress, and checks this state before starting a job. Is my time-out on the disabled concurrent execution too low? Is it possible in this scenario that the database connection to update the state takes about 30 seconds (to be fair, it is running on a slow server with ~8GB Ram, 6 vCores) due to which the second worker is already picking up the job again?
Or is this a Hangfire specific issue that must be tackled?