Summary

Currently, the test integration framework has the capabilities to upload and test a network to a cloud environment (specifically GCP) but in addition to this, we want to add functionality to deploy and test a network on a local user machine. The cloud integration uses Terraform and Helm to deploy to a GKE environment to deploy the network and specified nodes. While this method works for cloud environments, we would like a more lightweight solution to run locally. Thus, we have chosen Docker Swarm as the tool of choice for container orchestration.

Docker swarm can configure a "swarm" on a local machine to deploy and manage containers on that specific swarm. Docker Swarm takes as input a docker-compose file in which all container information is specified and will handle the deployment of all containers on that local swarm. When we want to run a network test locally, we can create a swarm and have all the containers deploy via a docker-compose.json file that is built from specified network configurations. Docker swarm also gives us the ability to get logs of all the containers running in an aggregated way, meaning we do not have to query individual containers for their logs. This gives us a way to apply event filters to specific nodes (block producers, snark workers, seed nodes, etc), and check for test success/failure in a portable way.

Requirements

The new local testing framework should be run on a user's local system using Docker as its main engine to create a network and spawn nodes. This feature will be built on top of the existing Test Executive which runs our cloud integration tests. By implementing the interface specified in src/lib/integration_test_lib/intf.ml, we will have an abstract way to specify different testing engines when running the Test Executive.

The specific interface to implement would be:

(** The signature of integration test engines. An integration test engine
   *  provides the core functionality for deploying, monitoring, and
   *  interacting with networks.
   *)
  module type S = sig
    (* unique name identifying the engine (used in test executive cli) *)
    val name : string

    module Network_config : Network_config_intf

    module Network : Network_intf

    module Network_manager :
      Network_manager_intf
      with module Network_config := Network_config
       and module Network := Network

    module Log_engine : Log_engine_intf with module Network := Network
  end
end

To implement this interface, a new subdirectory will be created in src/lib named integration_local_engine to hold all implementation details for the local engine.

The new local testing engine must implement all existing features which include:

Starting/Stopping nodes dynamically
Sending GraphQL quries to running nodes
Streaming event logs from nodes for further processing
Spawning nodes based on a test configuration

Additionally, the test engine should take a Docker image as input in the CLI.

An example command of using the local testing framework could look like this:

$ test_executive local send-payment --mina-image codaprotocol/coda-daemon-puppeteered:  
1.1.5-compatible --debug | tee test.log | logproc -i inline -f '!(.level in \["Spam", "Debug"\])'

Note that this is very similar to the current command of calling the cloud testing framework.

Detailed Design

Orchestration:

To handle container orchestration, we will be utilizing Docker Swarm to spawn and manage containers. Docker Swarm lets us create a cluster and run containers on a cluster to manage availability. We have opted for Docker Swarm instead of other orchestration tools like Kubernetes due to Docker being much easier to run on a local machine while still giving us much of the same benefits. Kubernetes is more complex and is somewhat overkill for what we are trying to achieve with the local testing framework. Both Docker Swarm and Kubernetes can handle container orchestration but the complexity of dealing with Kubernetes does not give much payoff. Additionally, if we want community members to also use this tool, setting up Kubernetes on end-user systems would be even more of a hassle.

Docker Swarm takes a docker-compose file in which it will generate the desired network state. A cluster can be defined in Docker Swarm by issuing docker swarm init which creates the environment in which all containers will be orchestrated on. In the context of our system, we do not need to take advantage of different machines to run these containers on, rather we will run all containers on the local system. Thus, the end result of the swarm will be all containers running locally while Docker Swarm provides availability and other resource management options.

Creating a docker-compose file for local instead of terraform on cloud

In the current cloud architecture, we launch a given network with Terraform. We specify a Network_config.t data structure which holds all necessary information about creating the network and then it is transformed into a Terraform file like so:

type terraform_config =
    { k8s_context: string
    ; cluster_name: string
    ; cluster_region: string
    ; aws_route53_zone_id: string
    ; testnet_name: string
    ; deploy_graphql_ingress: bool
    ; coda_image: string
    ; coda_agent_image: string
    ; coda_bots_image: string
    ; coda_points_image: string
    ; coda_archive_image: string
          (* this field needs to be sent as a string to terraform, even though it's a json encoded value *)
    ; runtime_config: Yojson.Safe.t
          [@to_yojson fun j -> `String (Yojson.Safe.to_string j)]
    ; block_producer_configs: block_producer_config list
    ; log_precomputed_blocks: bool
    ; archive_node_count: int
    ; mina_archive_schema: string
    ; snark_worker_replicas: int
    ; snark_worker_fee: string
    ; snark_worker_public_key: string }
  [@@deriving to_yojson]

type t =
{ coda_automation_location: string
; debug_arg: bool
; keypairs: Network_keypair.t list
; constants: Test_config.constants
; terraform: terraform_config }
[@@deriving to_yojson]

https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/coda_automation.ml#L35

We launch the network after all configuration has been applied by running terraform apply

We can leverage some of this existing work by specifying a config for Docker Swarm instead. Docker Compose can use a docker-compose file (which can be specified as a .json file https://docs.docker.com/compose/faq/#can-i-use-json-instead-of-yaml-for-my-compose-file) to launch containers on a given swarm environment. The interface can look mostly the same while cutting out a lot of the specific information needed by Terraform.

type docker_compose_config =
    { 
    ; coda_image: string
    ; coda_agent_image: string
    ; coda_bots_image: string
    ; coda_points_image: string
    ; coda_archive_image: string
    ; runtime_config: Yojson.Safe.t
          [@to_yojson fun j -> `String (Yojson.Safe.to_string j)]
    ; block_producer_configs: block_producer_config list
    ; log_precomputed_blocks: bool
    ; archive_node_count: int
    ; mina_archive_schema: string
    ; snark_worker_replicas: int
    ; snark_worker_fee: string
    ; snark_worker_public_key: string }
  [@@deriving to_yojson]

type t =
{ coda_automation_location: string
; debug_arg: bool
; keypairs: Network_keypair.t list
; constants: Test_config.constants
; docker_compose: docker_compose_config }
[@@deriving to_yojson]

By taking a Network_config.t struct, we can transform the data structure into a corresponding docker-compose file that specifies all containers to run as well as any other configurations. After computing the corresponding docker-compose file, we can simply call docker stack deploy -c local-docker-compose.json testnet_name

The resulting docker-compose.json file can have a service for each type of node that we want to spawn. Services in Docker Swarm are similar to pods in Kubernetes as they will schedule containers to nodes to run specified tasks.

A very generic example format of what the docker-compose.json could look as follows:

{
    "version":"3",
    "services":{
        "block-producer":{
            "image":"codaprotocol/coda-daemon-puppeteered",
            "entrypoint":"/mina-entrypoint.sh",
            "networks":[
                "mina_local_test_network"
            ],
            "deploy":{
                "replicas":2,
                "restart_policy":{
                    "condition":"on-failure"
                }
            }
        },
        "seed-node":{
            "image":"codaprotocol/coda-daemon-puppeteered",
            "entrypoint":"/mina-entrypoint.sh",
            "networks":[
                "mina_local_test_network"
            ],
            "deploy":{
                "replicas":1,
                "restart_policy":{
                    "condition":"on-failure"
                }
            }
        },
        "snark-worker":{
            "image":"codaprotocol/coda-daemon-puppeteered",
            "entrypoint":"/mina-entrypoint.sh",
            "networks":[
                "mina_local_test_network"
            ],
            "deploy":{
                "replicas":3,
                "restart_policy":{
                    "condition":"on-failure"
                }
            }
        }
    },
    "networks":{
        "mina_local_test_framework":null
    }
}

Logging:

Docker Swarm aggregates all logs from containers based on the running services. This makes it easy for us to parse out all logs on a container level without specifying specific containers.

The following is an example of the logs aggregated by Docker Swarm with 2 containers running the ping command.

$ docker service create --name ping --replicas 2 alpine ping 8.8.8.8

$ docker service logs ping
ping.2.odlt7ajje64e@node1    | PING 8.8.8.8 (8.8.8.8): 56 data bytes
...
ping.1.egjtdoz7tvkt@node1    | PING 8.8.8.8 (8.8.8.8): 56 data bytes
...

For our use case, we can specify different node types to be different services. For example, in our docker-compose configuration, we could specify a service for seed nodes, block producers, and snark workers and parse out the logs individually for each service. We can additionally do further computation on the logs to parse out which container is emitting these logs for a more granular level.

These logs can be polled on an interval and processed by a filter as they come in.

Interface To Develop:

The current logging for the cloud framework is done by creating a Google Stackdriver subscription and issuing poll requests for logs while doing some pre-defined filtering.

An example of this is shown below:

let rec pull_subscription_in_background ~logger ~network ~event_writer
    ~subscription =
  if not (Pipe.is_closed event_writer) then (
    [%log spam] "Pulling StackDriver subscription" ;
    let%bind log_entries =
      Deferred.map (Subscription.pull ~logger subscription) ~f:Or_error.ok_exn
    in
    if List.length log_entries > 0 then
      [%log spam] "Parsing events from $n logs"
        ~metadata:[("n", `Int (List.length log_entries))]
    else [%log spam] "No logs were pulled" ;
    let%bind () =
      Deferred.List.iter ~how:`Sequential log_entries ~f:(fun log_entry ->
          log_entry
          |> parse_event_from_log_entry ~network
          |> Or_error.ok_exn
          |> Pipe.write_without_pushback_if_open event_writer ;
          Deferred.unit )
    in
    let%bind () = after (Time.Span.of_ms 10000.0) in
    pull_subscription_in_background ~logger ~network ~event_writer
      ~subscription )
  else Deferred.unit

https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/stack_driver_log_engine.ml#L269

A similar interface can be written for Docker-Swarm instead. By defining a Service.pull function with a given logger, we can leverage a lot of the work already done by modifying parts of the code where the log formats diverge. All logs can be specified to an output stream, such as stdout or a specified file by the user on their local system.

Work Breakdown/Prio

The following will be a work breakdown of what needs to be done to see this feature to completion:

Implement the Network_Config interface to accept a network configuration and create a corresponding docker-compose.json file.
Implement the Network_manager interface to take a corresponding docker-compose.json file and create a local swarm with the specified container configuration
Implement functionality to simply log all container logs into a single stream (stdout or a file, maybe this can be specified in startup?)
Implement filter on event functionality
Ensure that current integration test specs are able to run on the local framework with success

Unresolved Questions

Is compiling a docker-compose file the right approach for scheduling the containers? The nice thing about using a docker-compose file is that all network management should be automatic.
Is using a different service for each type of node the best effective approach? Would it be better to launch all nodes under the same service in the docker-compose file?
Is polling each service and then aggregating those logs the best approach? Would it be better to do filtering before aggregating?
Does this plan capture the overall direction we want the local testing framework to go?

MinaProtocol / mina

[RFC] Local Testing Framework #8890