crodrigoquero / C4Imaging

Workflow system based in Windows 10 / Linux services.
1 stars 0 forks source link
api api-rest commandlineparser microservice plugins processing serilog services worker workflow

Background

About Workflows and Business Process Automation

A WorkFlow is like a kind of program or sequence of steps or phases, which is repeated regularly in a business environment, and in which people and / or software agents participate. There are several types of workflow, which I will talk about later.

The Animated_Social Media Workflow by Shinsaku Iwatachi

Illustration: The Animated Social Media Workflow by Shinsaku Iwatachi

Another possible definition of WorkFlow (more formal) is that it is a coordinated sequence of tasks in which different processing entities participate. These processing entities can be at different layers or levels within the system architecture, so they can be background workers, APIs, etc.

Each WorkFlow task runs in response to a WorkFlow managed business event. These events can be generated (triggered) by humans or by other processing entities. These events circulate or rather, flow, in the form of messages managed by a messaging engine (message broker).

APIs and/or Microservices Are Not Enough

Rest APIs provide only endpoints for users; they are not intended to carry out long processes.

The http requests that we send and receive from an API must be short (in time), a maximum of seconds, otherwise we will find ourselves faced with a huge nonsense: what is the point of launching an http request and having to wait minutes, even hours, for that request to end?.

So we shouldn't try to get an API to do all the work. In these cases you have to resort to the Background Workers. In other words: the API will pass all the heavy lifting to a background worker; when it receives the job, it returns a menssage of the type "WORK_UNIT_STARTED" (or something similar) to the API. When the job is finished, the workflow will issue another message or notification to inform the API of this circumstance, with a message of the type "WORK_UNIT_PROCESS_COMPLETED" or "WORK_UNIT_PROCESS_FAILED", if an error has occurred during the process.

By the way, to implement this message flow we will need a message broker like RabbitMq. Thus, both the API and the Backgrund Worker must be subscribed to such message broker.

In my opinion, if you are interested in Microservices architecture, you must first master the programming of operating system services or workers. Microservices and workers must work together and must communicate with each other through a message broker, as I have said before.

Many think only of APIs when they refer to microservices, but the truth is that a microservice can be several things: an API, an operating system service or (Daemon in Linux), a console application, etc.

Tipical Workflow Processes

Exsisting Solutions

To articulate the business process in a workflow environment, there are some interesting proposals out there:

Problem with existing solutions

Must of them are great. But the problem is you have to choose one... and to be able to choose one properly, you may have to test all of them, and to do that you are going to need to learn about them. The learning curve will high, so the whole process will take some time eventually.

If you're in a rush, which happens very often in software development, and if you are an experienced coder, probably you will provide yourself a solution from your own industry.

WorkFlows are a crucial part of the business process. So you have to be careful about what dependencias you are going to stabish on the company codebase. So you're probably going to want to partner with a company that will assure you of long-term support. Many small independent developers are brilliant, but they can't guarantee that.

Please have your say on this matter at this discussion.

About this project

This project implements a WorkFlow system that can execute business processes of various types and purposes and is mainly based on services / background workers for Windows 10 or Linux. It is developed with Visual Studio 2019 using the C # programming language. It is expected that its final version will allow end users to fully define any business process (WorkFlow) without having to resort to the development team.

In the current state of the project, each workflow is made up of an indeterminate number of services, which are responsible for executing each phase or state of the workflow to which they belong. These elements are called "states" or "workflow states" in the system domain vocabulary.

To facilitate communication between the calling application and the workflow, each workflow has its own API. There are other software components involved at that level, which are described in the "System Architecture" section of this document.

An example workflow is included that categorizes images in order to illustrate its operation with a practical case. Over time, other workflow states of general utility will be added that can be reused in a multitude of workflows.

This project is about Data Workflows

The type of WorkFlow that this project implements is a Data WorkFlow (see "Tipical WorkFlow Processes" section for more details). The implementation of other workflow type will be carried out through other projects (also based in Windows Services).

When to use a Data Workflow

To explain when you want to consider to use a data workflow to carry out some sort o process, let me show you some examples (use cases) from the real life:

  1. A user needs to parse a large set of pdf files to extract data from them. Once each file is parsed, the extracted data needs to get inserted in a database, in order to provide final user full text search capabilities in a cartain online document management facility.
  2. A list of files need to get physically organized or categorized in directories every day. Once process is completed, the system can deduct any file's category hierarchy by getting its full path and spliting it by "\" character.
  3. A company needs to process periodically very large files for wahtever reason.
  4. Some company data scienctifist need to implement a machine larning pipeline. Machine learning pipelines consist of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.
  5. A company needs to create a web crawler to find some particular data from the internet, and once done, proceed to index such data and or perform certain actions.
  6. You want to produce image files with statistics and graphs based to datasets. And after that, be able to proceed to create presentation files based on the aforementioned image files.
  7. You need to encode video from a series o image sets.
  8. A energy supplier company needs to calculate the monthly charges based on customer’s service plan and the usage. To get usage for every single user, the company needs to access to several external data providers in order to get the enery metter readings for example.
  9. A recruitment team needs to categorize all the received CV by dfferent criteria: applicants skills sets, applicants professional field, etc.
  10. A goverment agency need to validate and categorize sets of users documents like driving licenses, univerty credentials, and other similar paperwork. The system needs to use a tensorflow model to recogize all of these documents.
  11. You need to manipulate Microsoft Word documents and you no want to instantiate MS Office on the server every time you need to open one of those documents (you can, for instance, open a Ms Word document using the Aspose Word C# library inside your Workflow node to do such a thing).

Conclusion

So, hen to use a data workflow? Baically when you need to process data. Data processing is huge which include validation, classification, sorting, calculation, organization and transformation of data. A data workflow is not iteractive, i.e. it doesn't need human intervention along its processing cicle. If a process need human inervention, lets say some kind of validation, then you need to implement a different workflow model.

A data workflow is almost like a blackbox. However, a data workflow can report its processing progress and final state.

Use a data workflow whenever you have to implement a process that has two or more steps (states). Otherwise, just implement a single windows service (or state).

Project Domain Vocabulary

REMARK: In this early stages of the project development, I'm not going to implement all of these concepts, but is good to start getting familiar with them from now on. Any contribution or ammendment in the conceptual layer of this project will be appreciated.

System Architecture

In this project, each workflow is composed by a series of services, which in the domain of this system are called WorkFlow States. The functionality of each one of these states is very concrete, humble and isolated, so a workflow state is reusable; such particular functionality can be useful in another WorkFlow.

Each WorkFlow state can be moved from one given location within the WorkFlow chain to another. This is possible because all of the WorkFlow states have exactly the same structure and behavior. A workflow state that currently occupies the eighth place in the chain, can moved to the first place, and everything will continue to work perfectly without the need to make changes to any workflow state' code.

Each WorkFlow state is nothing more and nothing less than a Windows 10 service (or a Linux Daemon). Visual Studio 2019 offers a project template to create these services in DotNet Core 3.x (worker service template).

The Visual Studio Solution projects

  1. C4ImagingNetCore.Notifier.Logging: Obstraction layer to decouple logging provider
  2. C4ImagingNetCore: [removal inminent] All the project functionality will moved to "Helpers" project.
  3. C4ImagingNetCore.BackEnd: [removal inminent] All the project functionality will moved to "Helpers" project.
  4. C4ImagingNetCore.Helpers: Cotains a few helper classes and enums.
  5. C4ImagingNetCore.UI:[removal inminent] Console app for manual testing in the early stages of the project
  6. C4ImagingNetCore.Workflow.Srv: The first WorkFlow State I've created.

REMARKS: A better project naming convention must be applied (domain-based also). This list will be updated periodically.

System Components Summary

Below you can see the system components list and the the current development status of each of them:

REMARKS: This list will be updated periodically, until the backend architecture will get fully defined and stable. Please keep into account that because the system's building process is still in the architecture design level, that all these components are still subjected to analisys and appraisal. So some of them can ptentially get removed, renamed or redefined. Also, new componenets can be added.

Basic Operation

The workflow states work with files, that is, they accept files as input and produce files as output. Additionally, the files in the output can be grouped or categorized in directories. That is all. Both the input and output files can be images, MS Word documents, Json documents, XML documents, datasets, etc., or a combination of all of them, depending on the use case.

Thus, and since the worklow states work only with files, they monitor one or more directories. In particular, each service monitors the output directory or directories of its predecessor in the WorkFlow chain. The first wokflow state monitors the workflow's inbox directory (which always has the same name i.e. inbox. When the first service is executed for the first time, it creates a directory that has the same name as the service assembly, and within that directory a subdirectory called "InBox" is created).

As the WorkFlow execution progresses, these files move from one directory to another, that is, they disappear from the output directory of the previous WorkFlow state and appear in the output directory of the next WorkFlow state. In future versions of the system, this will not be always entirely true, since the system will progress in complexity and the output of a service can be redirected to another WorkFlow, if certain conditions are met. In such cases, there will be a new component in the middle of each service that will be in charge of making certain decisions and redirecting traffic accordingly (transitions).

The workflow state output directories can referrence to categories or assertions into their names.

System caracteristics

  1. Woekflow States are fault tolerant: If the system crashes, the execution can be continued from the precise point where the system was stopped. This is so because the first thing each service does is look for files already present in its input directory to process them before continuing its normal execution.

  2. Workflow States has a configurable startup: States can be configured at install time in order to modify its operation, by passing different arguments to the startup command line args array. For example, a given service can be instructed to process only certain types of files by passing as an parameter/argument the list of allowed file extensions. You can also change the default input and output directory.

  3. Workflow States are moveable: Potentially, Workflow States can occupy any position within the WorkFlow's chain. The position that a service occupies does not affect its internal operation; It is not necessary to make any changes to its internal structure if such service needs to be moved to a different position than its current one within the WorkFlow. This assertion will be true as long as the order of a given state does not affect negatively the final product of the entire worklow, or compromises its efficency, accuracy or precision, which can occur in some cases (for exmple, complex calculations, code generation and database operations).

  4. Workflow Sates can be useful by themselves: The general purpose WorkFlow states created in this project can be useful by themselves, without necessarily being integrated into a WorkFlow; Not only are they interesting in a miservices / monolithic architecture, but they can also be useful on a desktop computer that has Windows 10 Home Edition installed.

  5. Workflow Sates can be configured to change its behaviour: They can be configured work with single files, groups of files (in zip format) or with work Orders. The default input an output directories can be configured too in order to facilitate its linking to other WorkFlows. Also, the accepted file type into the service's InBox can be configured too.

  6. Workflow States can be instantiated multiple times: You can created multiple instances of the same worklow state by passing different start up parameters to each state instance, which gives you the additional benefit of be able to easily compartmentalize the original state process into several separate and independent threads executed by those new instances. For example, if I have a state that processes image files of the types bmp, jpg, png, gif, etc., maybe I could create an instance that processes only jpg files, which are the most numerous on my system. The only thing I have to do is pass just that file extension as a startup parameter to my new instance.

Operational Premises

  1. A backup copy of the input files must be made: Some workflows can make modifications to the input files. So as norm, and in order to be able reverse or cancel the process, backup copies of the input files must be made before the service process begins if the service is going to make changes on those files.

  2. Each workflow must have its own API: In the scenario that I am trying to describe, each workflow is made up of a series of Background Workers. The workflow has an API that allows applications to interact with it. Such API does more than being the workflow entry point, it also manages other things. Perhaps the API can implement an endpoint called "ProcessStatus" that accepts the token of a process as a parameter. I will describe the API WorkFlow structure later. So yes, to allow the calling aplication to interact with a given workflow, and in order to simplify the things, such a workflow must have an API on its side. Otherwise the calling application will must to know a lot of internal details about every single workflow state (or service). So through encapsualation concept, all those details will remain hidden for the calling application.

  3. No database engine in workflow nodes level: At the workflow nodes level there is no database engine directly involved. WorkFlow states on this type of WorkFlow (data WorkFlow) doesn't need to use a Db for its internal functioning. But they can use indirectly a Db on its process in order to get the neccessary data to complete certain task. However, Workflows can use a db engine.

  4. Cannot install two services with the same name: No comments. Windows will not let you to do such a thing.

  5. Services are isolated workers: No Message Broker in services level; services can't comunicate to each other, or to any other system. They don't have external dependencies to carry out their main (and unique) task.

  6. WokrFlow States does just one thing and does it well: The WokrFlow States (services) only process files to parform a concrete task on them, and keep an operational log during their their lifes. They are not responsible for launching notifications of any kind, which is work for other components of the system, which are responsible for monitoring, configuring and managing the services of a certain WorkFlow. Using classic workflow vocabulary, we can say that workflow States are responsible for one state; their state.

  7. Each WorkFlow must have its own Message Queue: Such a queue must be managed by the proper workflow components (i.e. Workflow controller or WorkFlow observer).

  8. Every WorkFlow states can be installed independently: Every WorkFlow state must have its own and idependent installer. This policy allows any user to be able to install locally any WorkFlow state. The insllation proceess must be extremely simple.

  9. States must not call external executable files: All the state processes must run in the state application thread pool.

Some Deployment Configurations

The system components can be deployed partially or totally in different ways to fulfill different users and business scenarios needs.

Below you can see some deployment posibilities.

Example

Lets say we have 4 worklow states:

In the first two cases, the user drags and drop a set of images on the workflow inbox directory. In the last case, this action will be performed by the workflow API.

Below I'm going to show several deployments with these states and its outputs.

Deployment Diagram for 1st Sample Configuration (single-state version) UML Deployment Diagram for 1st Sample Configuration

And here is state 1 output (images categorization by aspect ratio):

Deployment Diagram for 1st Sample Configuration

As you can see the output is so simple: just categorises images by aspect ratio. Things are going to get more promising in we combine some of these states together. Lets keep going.

Deployment Diagram for 1st Sample Configuration (multi-state version)

Next we can see some kind of basic deployment, but this time with some more states, which all together are going to do something a llitle bit more interesting:

UML Deployment Diagram for 1st Sample Configuration

And here is state 4 output:

Deployment Diagram for 1st Sample Configuration

So, the workflow will conclude that your "Susan.jpg" file is an IMAX image, which was taken in Spain on 2019, and it has a resoution of 4000 x 3000 pixels. Great. And of course, we can change the order of the worklow states, and we will get a different categroization if it is needed. For instance, probably we want to categorize first by year, and then by location, leaving the image techical details to the lowest levels of the categories' hierachy.

Deployment Diagram for 3th Sample Configuration (full-backend version) Deployment Diagram for 3th Sample Configuration

The output of this of this last deployment will be identical to the previous one, i.e., the files are going to get organissed in the same way, but the difference here is the presence of the api. With the api involvement, we are in the workflow level; we are not going to interact with any state directly any more. Through the api we can interact with the entire workflow, by sending workorders or files to the worklow inbox. The api will also return workflow esxecution progress data for every for every workflow state and the final worflow state result. The api can also return the worklow output in json format if it is needed.

For more information about this topic, please have a loock at this issue.

Final Goals

  1. The final system must be able to futfill all the possible configurations (see "Some Deployment Configurations" section for more details), and be effcient in all of the possible use cases and scenarios.
  2. A series o general purpose and always-useful WorkFlow Sates will be constructed (on the main branch of the project), for demonstration purposes and to enrich the system. The list of those must grow over time.
  3. Web applications and rest api must get reduced to a series of CRUD operations; i.e., in the final and ideal state of the system, no direct communication with any message broker like RabbitMq must be allowed at web app or rest API level. These components can only interact with the workflow engine by sending work orders and by requesting work order statuses and its results. This strategy will eliminate the existing coupling between the mentioned levels and the behavior / bussiness rules of the system. In other words; where the things must really happen is in the worklows level.
  4. All the WorkFlow Sates must have its own release at GitHub, so the final user can install them independently.
  5. Kubernetes scripts and Docker images must be available to carry out any deployment mentioned on the point #1.
  6. Some Kubernetes scripts and Dcoker Images can be automatically generated by the corresponding [undefined] tool.

Next Steps

Perhaps in the middle term, there will be a single API for all WorkFlows. How this will be done? Well, a new abstraction layer will be added in a way that a WorkFlow will become a data structure (probably using the json format). Thus, a single main API can handle multiple WorkFlows.

That circunstance leads us to the possibility that will be the user who'll define the contents of such a structure through the aforementioned API, which is the same as saying that the users can define their own WorkFlows without any futher intervention from the software development team... Great.

Whichever direction the architecture takes, what is clear now is that some general purpose WorkFlow States can be developed right now. Plase have a look on this discussion if you're on the belief that you have a great idea regaring work flow states which you want to turn into something concrete.

Do not forget to read the contributing file to learn more about how you can contribute to the project.

Bibliography

Stack Overflow Discussions