elsa-workflows / elsa-core

A .NET workflows library
https://v3.elsaworkflows.io/
MIT License
6.26k stars 1.15k forks source link

Elsa 3 rc1 beats the record and starts in "just" 6 and 1/2 minutes #4285

Open o2alexanderfedin opened 1 year ago

o2alexanderfedin commented 1 year ago

There is nothing unusual with the configuration. I am using PostgreSQL@Docker as a DB, which is shared between Elsa and a client app on my dev machine (MacBook Pro M1). It is completely unpredictable, as just a couple of minutes ago it took about 5 seconds to start. Similar slowness happened before, but it never took THAT MUCH LONG, just 10-15 seconds max. Worked with RC1 for at least a couple of months w/o any updates. Cannot tell what is going on with it, as it does not say anything. No DB queries in the log during that time, nothing.

Any idea why?

sfmskywalker commented 1 year ago

I have never seen this or heard about it. For me it always starts fast. I never measured it, but I think it’s sub-second on my M2.

What I would do is start a new project, build it up piece by piece, until you reproduce it. Or the other way around: remove parts one by one u til the problem goes away. And/or try swapping out components, e.g. SQLite instead of Postgres, also one by one.

o2alexanderfedin commented 1 year ago

I have never seen this or heard about it. For me it always starts fast. I never measured it, but I think it’s sub-second on my M2.

I have coded activities and flows.

What I would do is start a new project, build it up piece by piece, until you reproduce it. Or the other way around: remove parts one by one u til the problem goes away. And/or try swapping out components, e.g. SQLite instead of Postgres, also one by one.

It does not go to the DB, just doing some initializations. So, Postgres or SQLite is not relevant here.

Is there a way to turn on some very verbose logging to see what exactly is going on in there?

sfmskywalker commented 1 year ago

Yes, from appsettings.json, set the default log level to Debug and make sure there are no namespace overrides with a less verbose setting.

o2alexanderfedin commented 1 year ago

It looks like that if you have a very sophisticated workflow (i.e. deep tree of activities), then at some point it becomes an issue.

I'll try to use Elsa source code instead of NuGet packages later this week to debug it out.

sfmskywalker commented 1 year ago

Interesting. If you want, feel free to send me a sample of a workflow JSON that I can import, provided that it only uses “built-in” activities.

johnwc commented 1 year ago

I can vouch for this issue as well. All of our workflow nodes take around 6 minutes or more to start. We are on Elsa v2. And we do have pretty large sophisticated workflows.

sfmskywalker commented 1 year ago

For Elsa 2, the issue is explainable. There, it happens when there are many workflow instances and the system is recreating triggers. For Elsa 3, I’m very curious to see this reproduced.

johnwc commented 1 year ago

How can elsa 2 be improved? This is hurting us running in K8s with startup times so large for each pod.

sfmskywalker commented 1 year ago

If the trigger recreation on startup is indeed the issue, then I think we should make that startup job optional so you can disable it, while enabling Quartz.NET’s persistence provider if you haven’t done so already. This will improve startup time significantly when you have a large number of unfinished workflow instances. If that’s not the case, then we would need to first understand what other reasons there are for slow startup that I haven’t seen yet.

johnwc commented 1 year ago

Couldn't that process be moved to a background job during startup? Should it be holding up the entire start of the application?

sfmskywalker commented 1 year ago

Definitely another good option.

johnwc commented 1 year ago

It seems though the indexing of triggers is the main culprit for us. What are the options here to improve this?

I also have the full debug log dump to share with you if you have a way for me to email it to you? I only enabled Debug logging level for Elsa and Hangfire. Maybe you can see if there is anything else going on in the logs that jumps out as startup issues.

info: Elsa.Services.Triggers.TriggerIndexer[0] Indexed triggers in 00:07:19.3945549
sfmskywalker commented 1 year ago

If you possess a substantial number of workflow definitions in the database (for instance, several hundreds), it might be beneficial to disable trigger indexing for those provided by the DatabaseWorkflowProvider. The reasoning behind this is that the triggers for these workflows are already established upon publishing the workflow. Therefore, re-indexing them during startup becomes redundant.

Should you have a large volume of workflow instances stored in the database, the likely source of the issue could be the StartJobs hosted service, which resides in the Elsa.Activities.Temporal.Common.HostedServices namespace within the Elsa.Activities.Temporal.Common project.

It's essential to note that this job becomes necessary only if:

  1. You're utilizing Quartz.NET in its in-memory mode as opposed to its SQL persistence provider.
  2. You're not employing Hangfire or if you're using Hangfire with a persistence provider other than in-memory.

In summary, it's crucial first to ascertain the number of workflow definitions and workflow instances in your database. This will give a clearer picture of the root cause of the startup delays.

Regarding the StartJobs hosted service, it's registered with Dependency Injection (DI) in the AddCommonTemporalActivities extension method. If you wish to prevent this service from being registered, you have a couple of options:

  1. Manual Registration: Avoid using the AddCommonTemporalActivities extension method altogether. Instead, manually copy only the components you require into your custom extension method.
  2. TimersOptions Flag: Through a PR, introduce a flag within the TimersOptions object. This flag would allow you to control whether or not the StartJobs service should be registered with DI.
johnwc commented 1 year ago

We have 60 workflows, all database workflows. We were using Quartz with db persistence, but when we added hangfire we removed quartz in place of hangfire temporal services. What do you recommend for temporal services? Is quartz better, or should we continue to use hangfire temporal? We have multiple nodes, so we always use DB persistence for temporal services.

With this knowledge, what do you recomend?

sfmskywalker commented 1 year ago

I think Quartz.net is more reliable as a high resolution timer, but Hangfire should be good too. Both work great on a multinode environment and they both persist triggers. In which case, I recommend trying to disable the StartJobs hosted service and enjoy speedy startup times ;)