Distributed Orchestrator Service

This feature will allow the orchestrator to run in distrbuted mode, with several instances of the orchestrator service running in a redundant, hot-hot configuration.

The existing job cache API is already designed for distributed operation. It uses a ticket system to grant tickets for performing operations such as adding, updating and removing jobs from the cache. The cache ensures that only a single ticket can be granted per job at any time.

The current implementation uses an in-process cache based on a simple Java concurrent map. To allow for running multiple orchestrator processes, a second implementation of the job cache is needed that will use a SQL database to coordinate locks, cache entries and tickets. This can re-use a lot of the foundation logic from the metadata service to handle a selection of common SQL dialects, so the job cache will support all the same dialects available for the metadata service.

As a first step, the job cache interface must be exposed as a plugin that can be configured in the TRAC platform config file using the LOCAL protocol, with an explicit contract for the cache API. Then the JDBC implementation can be added.

Other implementations using e..g Hazelcast or other in-memory distributed technologies are possible, however a SQL implementation will still easily give latencies of sub 100 ms, which is more than sufficient. A SQL implementation also meets the core principles of simplicity and reducing technology dependencies.

finos / tracdap

Distributed Orchestrator Service #387