Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
323 stars 204 forks source link

split the golang and Node.js parts of cosmic-swingset ? #5287

Open warner opened 2 years ago

warner commented 2 years ago

What is the Problem Being Solved?

In our chain architecture, each validator runs N+1 processes: one xsnap worker process for each vat, plus a single "hub" process to drive them all. This "hub" process hosts four components:

In ancient times, @michaelfig performed some deep magic to create this composite Go+Node executable. It depends upon some features of Node that might go away in future versions.

Recently, we've seen some strange crashes (#5031) in the loadgen task, in which this hub process experiences segfaults or double-free errors. We don't know if this is a Node.js assertion triggering, or something in the Go runtime.

We think it might be easier to diagnose problems like this if we could separate the hub into two separate programs: one written entirely in Go, the other written entirely in Node.js.

Description of the Design

We'd move cosmic-swingset (which would import Swingset) into its own process. The agd binary would be written in Go. When agd is launched, it would open some pipes and then spawn a copy of Node.js, pointed at the cosmic-swingset entry point .js file. All further communication between the Go side and the Node side would happen over those pipes, much like how the kernel talks to its workers over pipes.

Performance Considerations

This will introduce some new overhead between the two processes. Some examples of interactions that would be somewhat slower are:

We don't think those paths are heavily travelled right now, but making them slower might discourage us from taking fuller advantage of them. For example, it would help #3769 tremendously if swingset could simply store all of its state in the cosmos-sdk IAVL tree. Our first prototype actually did just that, but we found it was too slow. We've probably learned some tricks that would make it better now (we haven't tried the experimen recently), but that would certainly put the Go<->Node boundary on the hot path (worse because we already have a worker<->kernel hop on many of those DB requests).

Security Considerations

In general, separating software into smaller pieces is better for security, because the edges can perform validation on their inputs, and compromise of one component doesn't automatically mean compromise of the other. But, realistically, the kernel is such a deeply-trusted part of our system, that we wouldn't win much security by separating it from Node.js . A compromise of either side is fatal.

With the kernel/Node.js code running in its own process, we might be able to impose a #2836 -style seccomp jail upon it. If the kernel process became compromised, the attacker could control all of the vats, but could not take over the validator's account, which might protect other security and/or logging tools (including the supervisor that could re-launch the process).

Deployment Considerations

More processes are slightly more complicated. If one dies, we should make sure the other dies too. Admins will see more processes in their task listings, making it harder to know which one they should kill. However, most of this should be hidden in the primary agd process, and it's only a small increase over the status quo (where operators see several dozen worker processes already).

Test Plan

Probably just the existing integration tests in packages/cosmic-swingset, although it would be great to have an automated test of the two processes terminating if/when they see their peer exit.

mhofman commented 2 years ago

One impact is the loadgen which currently detects vat worker process by finding children of the chain process it starts. This will require updating the loadgen to support finding the kernel process as a child of the agd process first.

Tartuffo commented 2 years ago

Could control via a flag for the node binary and a flag for simagd (e.g. default is both in the same process, -split means we are running in split mode). Not a lot of changes on the Go side, since it talks JSON - serialize string and write to socket instead of calling function.

mhofman commented 2 years ago

Another argument for splitting would be signal handling. While looking into remediations for #5419, we realized that Node likely conflicts with go's signal handlers, and loses. We'd like to add a handler in Node to prevent committing a block after it has received a SIGTERM.

JimLarson commented 1 year ago

Notes from "Split Brain Design" document

Motivation

Current Architecture

Proposed New Architecture

Development plan

Notes on JSON-RPCv1

MVP for the actual split

Example Traffic

Vbank balance update

notification (agd → agvm):

{
  “id”: null,
  “method”: “vbank.BalanceUpdate”,
  “params”: [{
    “nonce”: 123,
    “updates”: [{
      “address”: “agoric1abc123”,
      “denom”: “uist”,
      “amount”: “654321”
    }]
  }]
}

Vbank balance query

request (agvm → agd):

{
  “id”: 123,
  “method”: “vbank.GetBalance”,
  “params”: [{“address”: “agoric1abc123”, “denom”: “uist”}]
}

response (agd → agvm):

{“id”: 123, “result”: “654321”, “error”: null}
mhofman commented 1 year ago

One issue that may potentially be fixed by this split is https://github.com/Agoric/agoric-sdk/issues/8047

aj-agoric commented 2 months ago

As discussed in the project management meeting for OrchestrationI will remove this from the orchestration delivery timeline and project