apple / swift-distributed-actors

Peer-to-peer cluster implementation for Swift Distributed Actors
https://apple.github.io/swift-distributed-actors/
Apache License 2.0
587 stars 55 forks source link

[WIP] New multi-node infrastructure for integration tests #1055

Closed ktoso closed 2 years ago

ktoso commented 2 years ago

This replaces our previous "bunch of shell scripts" integration tests.

Resolves https://github.com/apple/swift-distributed-actors/issues/900

I actually found a bug while doing this so will solve https://github.com/apple/swift-distributed-actors/issues/1054 while doing this.

Ignore the ad-hoc JSON Coders here, those were to debug the issue.

This introduces a new way to write multi node tests which can span actual processes and automatically join a cluster. We can aggressively KILL those processes and assert on the outputs of such clusters.

We will also easily be able to deploy tests written using this infra to multiple actual physical nodes or docker containers -- similar to how Akka's multi-jvm tests were doing way back then. This will allow us to verify on real networks etc.

It also is amazing for reproducers -- we can exactly replicate behavior, without having to do the weird "make sure we resolve as remote" and other dances.

Screenshot just FYI how an output looks like -- speaking for myself, I can't get complicated things solved without such reliable test infra, so I'm more than happy it is back!

Screenshot 2022-08-09 at 18 07 52

Running tests is done via swift package --disable-sandbox multi-node -c debug test (or just swift package --disable-sandbox multi-node test to run in -c release mode). The plugin automatically compiles and runs tests in individual processes.


This is how an example test-case looks like:


import DistributedActors
import MultiNodeTestKit

public final class ClusterCrashMultiNodeTests: MultiNodeTestSuite {
    public init() {}

    /// Spawns two nodes: first and second, and forms a cluster with them.
    ///
    /// ## Default execution
    /// Unlike normal unit tests, each node is spawned in a separate process,
    /// allowing is to kill nodes harshly by killing entire processes.
    ///
    /// It also eliminates the possibility of "cheating" and a node peeking
    /// at shared state, since the nodes are properly isolated as if in a real cluster.
    ///
    /// ## Distributed execution
    /// To execute the same test across different physical nodes pass a list of
    /// nodes to use when running the test, e.g.
    ///
    /// ```
    /// swift package multi-node test --deploy 192.168.0.101:22,192.168.0.102:22,192.168.0.103:22 // TODO
    /// ```
    ///
    /// Which will evenly spread the test nodes across the passed physical worker nodes.
    /// Actual network will be used, and it remains possible to kill off nodes and logs
    /// from all nodes are gathered automatically upon test failures.
    public enum Nodes: String, MultiNodeNodes {
        case first
        case second
    }

    public static func configureMultiNodeTest(settings: inout MultiNodeTestSettings) {
        settings.initialJoinTimeout = .seconds(5)
        settings.dumpNodeLogs = .always

        settings.installPrettyLogger = false
    }

    public static func configureActorSystem(settings: inout ClusterSystemSettings) {
        settings.logging.logLevel = .debug
    }

    public let testCrashSecondNode = MultiNodeTest(ClusterCrashMultiNodeTests.self) { multiNode in
        // A checkPoint suspends until all nodes have reached it, and then all nodes resume execution.
        try await multiNode.checkPoint("initial")

        // We can execute code only on a specific node:
        try await multiNode.on(.second) { second in
            try second.shutdown()
            return     
        }

        try await multiNode.runOn(.first) { first in
            try await first.cluster.waitFor(multiNode[.second], .down, within: .seconds(10))
        }
    }
}
yim-lee commented 2 years ago

Added CI pipeline to run integration tests.

@swift-server-bot test this please

ktoso commented 2 years ago

Uncovering bugs in receptionist rewrite where ordering wasn't quite right anymore resulting in test hangs (and bad receptionist ordering bugs in the op-log).

Mostly stable locally but still stabilizing tests while here...