Workflows - Server Side

When we initially created the workflow feature, we wanted It to be an all encompassing solution for client and server actions. This included allowing the user to perform actions based on triggers in the UI, as well as on the server side - such as record CRUD.

We want to make workflows entirely server side based, meaning that workflows can be triggered via HTTP, but only perform actions handled by the backend.

Workflows are managed by an orchestrator, which is essentially a class that iterates over a workflow definition and performs the relevant actions in it. You can see the orchestrator that was used for the client side apps in

packages/client/src/api/workflow

We need to:

Move the orchestrator to the server directory
Provide contextual awareness of each workflow step (so you can use the response of one in another step)
Allow them to be triggered through an HTTP endpoint with a workflow ID
Think about building workflow blocks and importing them from other places (GitHub, etc)
Run the workflow blocks in a different process

[x] Move the orchestrator to the server directory This work is complete, the orchestration is now handled in the server
[x] Provide contextual awareness of each workflow step (so you can use the response of one in another step) This is being handled as a separate task as part of #580
[x] Allow them to be triggered through an HTTP endpoint with a workflow ID This has been introduced as a POST endpoint, api/workflows/:id/trigger
[x] Think about building workflow blocks and importing them from other places (GitHub, etc) This is being handled as part of issue #586
[x] Run the workflow blocks in a different process This is still in discussion, whether we handle this in the future as part of a separate microservice, or whether we move ahead with using a child process - or perhaps the worker thread could be applicable.

Did some further research into the last point "Run the workflow blocks in a different process" - this is going to be very tricky client side. I've put an implementation in place for use of the worker thread API in node, using a worker farm but this will not work at all in the builder.

The reason being that LevelDB cannot be accessed by multiple processes at the same time, see here.

This is something that we could remove altogether and simply go with a single instance for now, or we could continue to use the multi-process implementation when running in prod and then only use the single threaded nature when running in the builder.

Once downside to this is that it is much harder to test that it is working correctly as we need to run a real CouchDB instance (which I don't believe the tests do currently?)

Would love your input on this @shogunpurple - whether you think we should simply remove this idea for now or whether we should work on testing it properly.

Apologies in advance, as I don't have all the background. Feel free to ignore me, I couldn't help chipping in to this exciting problem :)

It sounds like you want to

Run the same workflow code in production and locally
Run in single process locally
Pass off to another process in production

Could this be achieved using some RPC library - e.g. gRPC or zeroMQ ? Locally, we just send and listen in the same process.

e.g.

POST /api/workflow/123 { someData ... }
RPC Client > Send ( { someData } ) to WORKFLOW_SERVER
RPC Server > Receive { someData } > Process Workflow

On Local...

Environment variable: WORKFLOW_SERVER = 127.0.0.1:5000 On App Startup, RPC Server > Listen ( "5000" )

On Production

Environment variable: WORKFLOW_SERVER = 10.0.0.1:5000 (remote IP) On App Startup... Do not start RPC Server Start RPC Server on 10.0.0.1:5000

I am pretty sure that this can be achieved with zeroMQ, but i've never used gRPC. ZeroMQ depends on node-gyp though, which is always annoying to have as a dependency

That is some very interesting options and glad to have more input on it, I think whatever mechanism gets decided on here might end up being used for any other parts of the system that need to operate in the same manner, so it probably needs to have some thought put into it!

First I'll just confirm what the goal is (at least in my understanding so far)

Workflows can be triggered internally to the process, or externally via the API, so the orchestration and steps of the workflow should ideally be handled outside of the API process, so that the main thread of the server application isn't locked up handling internal processes.
Ideally locally I would like to run it in the same manner that it is executing in production but as PouchDB can only handle a single connection at a time the main server thread must perform any database actions required. For simplicity sake at this point I think it is easier to just run the workflows as part of the main thread.
Ideally in production I would split out workflows into their own service, any time a trigger is fired the trigger information is placed on a queue so that an external service can take over and carry out the actions - for the actions that exist today this isn't particularly critical but if we have any beefy actions in the future it may be useful.

One idea I had originally had to handle this is using Bull - an atomic queue backed by Redis for fast and efficient service orchestration - this would be easy to self host and I expect in the future we will likely have a need for a Redis cluster anyway so it might generally be a nice addition.

The use of it would however mean Redis needs to be running locally for the builder - I couldn't find an easy way to make this work (originally a played around with stubbing out Redis when running locally) so I built a basic in memory replica of the Bull API so that messages are queued locally within the server application.

To split out the workflow processing I made use of worker threads, through the use of worker farm which maintains a pool of worker threads that JSON can be passed to for processing - each time a message is lifted off the in memory queue it is passed to the worker farm which runs an orchestrator and executes the steps - however this doesn't work in the builder as when it attempts to connect to Pouch it fails. Pouch can state the type of DB it is currently talking to as part of its preferred adapter so I'm using that to detect if it is LevelDB under the hood or CouchDB to make the decision about what method to use.

I think in this instance the way that I am choosing whether or not to spin up multiple threads to execute upon or not is not great, I am definitely going to have a look at ZeroMQ to see if I could use it in the way you describe, in production spin up a completely separate child process running the ZeroMQ server and then have the process push messages to it, instead of using the worker farm library - it might be a fair bit safer and performance efficient to handle it that way. Having to use node-gyp is a bit of a pain though, although it is an electron app, so its not the end of the world!

One benefit I can see of going down the route of using Bull in the future would mean that any available process in any instance of the cluster would be able to handle the processing of a workflow or any other task similar to a workflow in nature, I'm not sure this would be as easy to achieve with ZeroMQ as we would need all instances to subscribe to each other - although still possible I believe!

I've not used gRPC myself either, I'll give it a look as well but I have to admit I actually don't know as much about as I would ZeroMQ!

Sorry for the wall of text but just wanted to capture all the different bits of research I've done around this during the week, plus get some other eyes on it as well (hopefully its all somewhat interesting!)

Some really good discussion here - it is indeed an exciting problem!

My 2 cents - the underlying and most pressing issue here is that levelDB does not allow concurrent access by multiple processes. This is true for both reader and writer processes. If we can solve that problem, it allows us to avoid coding around levelDB at all and to use a similar job queueing API regardless of environment.

It appears that there are a few potential solutions to this fundamental issue we could look at initially:

rocksdb is facebooks more fully featured and robust fork of levelDB, allowing for support from multiple reader processes with some config. See this excerpt from the FAQ below.

There's also a pouchdb adapter for rocksdb which should just be a drop in replacement for what we have now. If using rocksdb would solve our issue, this might be a good first solution to try out.

As Michael mentioned, just using the main thread locally and only worrying about a multi-process/worker threads when running in production. Of course we'd need to think about the potential impact of this. However it seems like it's something we can change fairly easily down the line if problems arise. I'm reluctant to over-engineer the local development experience for our users until we know the actual limitations they are hitting with running workflows locally. Since we provide a free tier, it's probably not out of the question to make a user deploy their app with one click to get access to much more advanced, concurrent functionality.

Failing these, we may then be stuck with the aforementioned levelDB limitations. In this case, we might have to look at zeroMQ, nanomsg or whatever other messaging library works in this scenario.

The nicest outcome here is that we can use the same stack for job management locally and in prod, with configuration changes only. It seems like ZeroMQ is a more node/electron friendly way of going about this. Given that it is written in C++, installing the library will install all the required first-class node bindings for running the server as well as spawning the workers.

Another benefit of this approach is that we have total control over it. We can constantly tweak and refine our queuing mechanism, even splitting it out into a completely separate service in production if required. What we need to ascertain before making a decision here is:

The cost of ignoring multi threading in the local builder and using a worker farm in prod. Is it worth deviating from this?
The engineering effort to implement ZeroMQ or similar both locally and in production. Some considerations include the potential impact to the size of docker images - what happens when you build and deploy the latest docker image to production? Fault tolerance is another - what happens when the node server dies? Do we lose jobs and messages? Does zeroMQ have job expiry, retries and nice things we may get out of the box with Redis/bull?
What is the happy medium when it comes to solving this problem, incurring a relatively low engineering cost and allowing us to extend it after launch/feedback?

To conclude - a quick note on this:

Pouch can state the type of DB it is currently talking to as part of its preferred adapter so I'm using that to detect if it is LevelDB under the hood or CouchDB to make the decision about what method to use.

We can probably just control this with an env variable that we set differently in prod. You can update those in the server Dockerfile 👍

Thanks for all the input on this guys, really enjoying the discussion :)

rocksdb is facebooks more fully featured and robust fork of levelDB, allowing for support from multiple reader processes with some config. See this excerpt from the FAQ below.

I was actually thinking about this as to whether there is a different adapter we could use locally for pouch to get the workflows to run in a separate process without running into this issue, sadly we need read/write access from both threads, as the main will be using read/write for the API and workflows are capable of taking write actions to the DB.

One way to get around this I considered was passing back any database write operations as and when they are needed, something like IPC would be needed for this, but it would be possible to implement - the downside is that to some extent it defeats the purpose of having a secondary thread as some actions have to hop back to the main thread to be performed, I also see that as an opportunity for some real headache bugs to be introduced.

The way I have implemented it now with only using a worker thread when in production was the quickest solution that I felt would provide the most value to start with - I'm trying to be careful not to pre-maturely over-optimize this part of the system but past experience tells me that running background processes in the main thread of API servers can lead to some really nasty scenarios. My thinking was as well that since all apps that go into production will be performing their workflows on the same pool of server applications there is a real danger that someone could deploy a particularly intense workflow - at least if the processing is always kept to a separate thread we know there is no risk of random API instance slowdowns.

The cost of ignoring multi threading in the local builder and using a worker farm in prod. Is it worth deviating from this?

I think you're right @shogunpurple that we shouldn't over-engineer the local experience; to be honest I only have one real concern about running in differently locally versus in production and that is its harder to test it as most test cases are probably going to revolve around the experience in the builder. The builder test cases will test the majority of code that is executed when it is deployed to production but this would be (as far as I understand it) the first main deviation in process between builder and production. There is of course ways to get around this - something I wanted to discuss was whether there were any test cases currently that operate against an actual CouchDB rather than Pouch.

The engineering effort to implement ZeroMQ or similar both locally and in production. Some considerations include the potential impact to the size of docker images - what happens when you build and deploy the latest docker image to production? Fault tolerance is another - what happens when the node server dies? Do we lose jobs and messages? Does zeroMQ have job expiry, retries and nice things we may get out of the box with Redis/bull?

I think personally I need to do a bit more research into ZeroMQ as I'm not sure I fully understand how it solves some of the problems that Bull or any other queuing technology can solve, as you've no central store of knowledge per-say - its a messaging library rather than a queuing technology but would give us the ability to build a distributed queue between instances (no longer requiring a separate Redis instance). This could be a complex approach however as the engineering effort of this vs spinning up a Redis cluster and using it to do all that work - quite a difference in effort.

We can probably just control this with an env variable that we set differently in prod. You can update those in the server Dockerfile

Also yes that was actually on my list for Monday to ask about if there was any differentiate today between running in the builder and running in production so that I could use that instead as I don't think the current method of checking the DB adapter is particularly safe - adding a production ENV is definitely a better way of going about it!

@mike12345567 there are no test cases that currently run against a real couchDB locally. That's not to say that we can't do that - we would have to create a separate suite of tests that can be run against a real couchDB instance in CI perhaps.

It seems like the first step here is to test our assumptions around running background workflow jobs in production and understanding the limitations of doing so. At this point, the solution may be to control a users number of workflows and how often they can be executed (all things we will be doing for real on the free tier, anyway).

Budibase / budibase

Workflows - Server Side #579

On Local...

On Production