elsa-workflows / elsa-core

A .NET workflows library
https://v3.elsaworkflows.io/
MIT License
6.24k stars 1.14k forks source link

Ensure Hangfire Scheduling works properly in a distributed environment #4819

Open sfmskywalker opened 7 months ago

sfmskywalker commented 7 months ago

A user reported that the Hangfire provider was causing the Timer activity to fire on all application instances in a cluster. They mentioned the use of the Redis provider for Hangfire.

Let's try this scenario and make sure that it works as expected.

dwevedivaibhav commented 7 months ago

Hi @sfmskywalker

I have an additional question related to processing background jobs in different workers within the Hangfire server.

To illustrate, let's consider scheduling 20 background jobs across 3 containers, each with 2GB CPU and 4GB Memory. During distribution, everything works seamlessly. However, when the server crashes or scales down and a new server is brought up, Hangfire resumes distributing jobs. I've noticed a glitch: if one container (Hangfire server worker) initially receives 10 jobs, subsequent jobs are not allocated to other servers until the earlier allocated 10 jobs are processed. Meanwhile, other servers with fewer or no jobs continue to remain idle.

Thanks in Advance!!!

sfmskywalker commented 6 months ago

@dwevedivaibhav Maybe I misunderstand, but it seems to me that this is the way Hangfire works.

dwevedivaibhav commented 6 months ago

Hi @sfmskywalker,

I hope you're doing well.

We've encountered a critical issue regarding job duplication when the server crashes and a new server comes up. Despite implementing distributed locks, we're experiencing instances where jobs running on the original server continue to run on the new server after it crashes and restarts. This is causing duplicates and impacting our system performance.

Your urgent attention to this matter is greatly appreciated. We need to resolve this issue as soon as possible to prevent any further disruptions.

Thank you for your assistance.

sfmskywalker commented 6 months ago

Hi @dwevedivaibhav , from what you described, it seems to me that this is how Hangfire works, since it is up to Hangfire to determine what to do when jobs are interrupted midway. I wouldn’t be surprised if it allows you to configure the desired behavior in such events, but I wouldn’t know offhand. Perhaps their documentation contains some clues.

sfmskywalker commented 6 months ago

Out of curiosity, I asked ChatGPT. It essentially confirmed my suspicion, and offers some additional insights such as the importance of implementing jobs with support for being idempotent (the ability to be executed multiple time without causing issues).

Here's our conversation if you’re curious: https://chat.openai.com/share/c3002890-b16a-4c54-9a27-24419e37fcde

dwevedivaibhav commented 6 months ago

Hi @sfmskywalker,

Thank you for your response.

I appreciate that you've already reviewed the Hangfire documentation even through i also reviewed. However, despite implementing distributed locks, we're still encountering issues where the lock doesn't seem to function correctly when the server crashes and a new server comes up.

I'm reaching out to see if there might be any specific configurations or considerations that we may have overlooked. Your insights or suggestions would be greatly appreciated as we work to resolve this issue.

Looking forward to your guidance. please find the below configuration which i have added in startup to support distribution of workflow

services.Configure(Configuration.GetSection("TaskSettings")); services.Configure(Configuration.GetSection("ConnectionsMongoDatabaseOptions")); //configure Redis services.AddRedis($"{distribiutedCacheRedis?.ConnectionString}"); var migrationOptions = new MongoMigrationOptions { MigrationStrategy = new MigrateMongoMigrationStrategy(), BackupStrategy = new CollectionMongoBackupStrategy()

    };
    var storageOptions = new MongoStorageOptions
    {
        MigrationOptions = migrationOptions,
        CheckConnection = false
    };
    services.AddHangfire(configuration => configuration
   .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
   .UseSimpleAssemblyNameTypeSerializer()
   .UseRecommendedSerializerSettings(settings => settings.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb))
   .UseMongoStorage(mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName, storageOptions));
    services.AddHangfireServer((sp, options) =>
    {
        options.HeartbeatInterval = TimeSpan.FromSeconds(2);
        options.ConfigureForElsaDispatchers(sp);
    });
    services.ConfigureCustomLogger();
    services
        .AddElsa(elsa =>
        {
            elsa.UseMongoDbPersistence(ef => ef.ConnectionString = mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName);
            elsa.ConfigureDistributedLockProvider(options => options.UseProviderFactory(sp => name =>
            {
                var connection = sp.GetRequiredService<IConnectionMultiplexer>();
                return new RedisDistributedLock(name, connection.GetDatabase());
            }));
            elsa.UseRedisCacheSignal();
            elsa.AddQuartzTemporalActivities();
            elsa.UseHangfireDispatchers();
        });
    services.AddElsaApiEndpoints();  
sfmskywalker commented 6 months ago

However, despite implementing distributed locks, we're still encountering issues where the lock doesn't seem to function correctly when the server crashes and a new server comes up.

What issues specifically are you encountering with distributed locking in combination with server crashes and new servers coming up?

dwevedivaibhav commented 6 months ago

Hi @sfmskywalker I'd like to address an issue we've encountered with our workflow execution across multiple servers. Allow me to illustrate with a simple example:

We have two servers, Server A and Server B. On Server A, we have Workflow A running, and on Server B, Workflow B is running. In the event of Server B crashing, a new server, Server C, takes over and resumes Workflow B from where it left off. However, we've observed an unexpected behavior where Workflow A from Server A also starts running on Server C, causing duplicate calls and inconsistencies in our system.

Despite implementing distributed locks, we're puzzled as to why Workflow A from Server A is being executed on Server C. This issue does not occur under normal circumstances when servers are not crashing.

We would greatly appreciate any insights or suggestions you may have on resolving this issue and ensuring the proper execution of workflows across servers.

Thank you for your attention to this matter.

sfmskywalker commented 6 months ago

I see. It definitely should not execute twice. Perhaps the workflow 's status is Running in the DB, which could explain why it is picked up by Server C when it boots, since the engine will attempt to continue running workflows it thinks have been interrupted (by checking if their status is Running.

I didn't realize it before, but I'm afraid we have hijacked this GitHub issue with a topic that is off-topic. Your issue is with Elsa 2, while this GitHub issue is for Elsa 3. If you don't mind, please open a separate issue about this problem, then I will tag it with Elsa 2 and add my comments there as well.

Thank you.

dwevedivaibhav commented 6 months ago

Hi @sfmskywalker I Have already raised this issue , Please find below issue link:

https://github.com/elsa-workflows/elsa-core/issues/5045

sfmskywalker commented 6 months ago

Perfect, thanks 👍🏻

NightWuYo commented 4 months ago

A kindly ping, confirmed that this issue still exists in current latest 3.1.2 version