Use Deno for derivation runtime

psFried commented 1 year ago

The goal is to use Deno instead of NodeJS for executing transform lambdas. The primary motivation for this is security and isolation. The current NodeJS implementation is difficult to sandbox, and Deno has a much better story around that.

Lambdas should be fully sandboxed. There should be no known way for a lambda from one derivation to read or modify any memory related to any other tasks.
We should allow both NodeJS and Deno to co-exist for a while, so that we can transition existing derivations incrementally, and retain the ability to rollback if there's an issue.

psFried commented 1 year ago

I've done some research and playing around with Deno and deno_core and wanted to share some notes on how I see the major pieces fitting together.

Build-time operations

In addition to executing the lambdas at runtime, there's a number of other things that we need to do during builds.

Typechecking
Resolve dependencies
Packaging

Currently, those operations are handled by a mix of npm and tsc. Deno provides all of those operations in a single subcommand called bundle, which emits a single Javascript file with all dependencies inlined. This would be stored in the build database with a DENO_MODULE type in the resources table. Integrating Deno as a library gives a much more low level interface where all those operations are separate. I'd propose that, at least for starting out, we use the CLI for build-time operations. The Deno CLI would be added to our docker image, and the builds would shell out to deno bundle to produce a fully inlinded JS file for each derivation.

Changing the scope of the bundle

Our current NPM modules are scoped to the build. That is, there is a single npm module that includes the code for all the derivations within a given build. As part of our move to Deno, we should change the scope of those so that we end up with a separate module (and a separate V8 isolate) for each derivation. Having a separate isolate per derivation allows for better isolation, as the name implies, and also allows multiple derivations from the same build to run in parallel on the same reactor.

Direct invocation seems more efficient, anyway.
It seems desirable to isolate the modules from different tasks, anyway, for a variety of reasons.
We could potentially also use a separate "realm" ("context" in V8 terminology) for each lambda within an isolate, to guard against users who may try to update some global JS state in one lambda invocation and then read it in another.

Runtime integration

Deno doesn't yet support unix sockets for HTTP, so we can't simply re-use the existing integration model from NodeJS. We probably wouldn't really want to, anyway, so NBD.
Integrating with Deno as a library is very different from running Deno as a CLI. The deno_core library is much more low level than the CLI. Even things like console.log need to be explicitly configured.
Deno requires an async executor. It's unclear whether our existing "haxecutor" will work for this purpose. It can probably work well enough if we disable file and network permissions, but requires testing.
I think we'll want to disallow network access from derivations, anyway, at least for now. In the future, we might be able to allow access to only the public internet or something. I'm not taking a position on whether that's a good idea or not, but just trying to limit scope in the short term.
It seems like the design of Deno doesn't really afford "pushing" data. You can't just ask Deno (or V8, for that matter) "please invoke function X with parameters Y and Z", at least not directly. You can pass it a string and say "please execute this script", but that seems like a poor way to pass data into the runtime.
Instead, the prevailing pattern seems to be that you define ops, which are registered at runtime creation, and then you expose each op as a JS function that client code can call.
It seems desirable to have Deno-based lambdas use exactly the same function signatures that our Node-based lambdas do today. In other words, the Typescript code that's written by users for NodeJS transforms ought to be accepted as-is for Deno.

I think the basic form of the integration would be an Extension that we implement. This extension would provide ops that allow JS code to read in documents from the runtime and send results back to the runtime. We would then have an automatically generated main.ts module that drives the execution of user-provided transform functions using those primitives. Something like this extremely hand-wavey illustrative example:

while (true) {
  let next_doc = Flow.getNextSourceDocument();
  let userFunction = lookupUserFunction(next_doc.source_meta);
  let result = userFunction(next_doc.document);
  Flow.sendResult(result);
}

A particular detail I'd like to point out is the lookupUserFunction function. We need some way to associate source documents with specific update/publish lambdas. There's lots of different ways we could do that, and I'm not suggesting any particular implementation just yet (and don't know what exactly source_meta is). But some options that I see there are:

Use and integer to identify each source binding, and a flag that says whether it's an update or publish lambda. Use those to lookup the proper function in an automatically generated data structure in JS.
Have a separate loop per lambda, where each loop runs concurrently and passes the binding id to getNextSourceDocument
Have the runtime pass in the name of each function directly on the next_doc
Something else?

Some other things to note about runtime integration:

I haven't yet measured how much overhead there is for an op call (though I suspect it's tiny). I say this only to point out that there could be some benefit to batching documents as they transit the Rust<=>V8 boundary, and thus the actual API might change a bit.
Our extension would also provide console.log and console.error implementations that directly forward to our log publisher.
There's no reason why the main.js has to be generated at build time. This could be done at runtime, though I don't necessarily think that's what we should do.
V8 has the concept of "Snapshots", which essentially just allow you to make creating new context objects much faster (anecdotally, ~40ms to ~2ms). I'm ignoring those, because creating a new context should only be done once per task term, so those ~38ms don't seem very important to us.
The high-level integration model of having the script call back into the runtime could afford a lot of potential features in the future. For example, we could allow the derivation more flexible access to the rocksdb-backed KV store

psFried commented 1 year ago

In terms of the migration, I think it makes sense to have the flowctl-go api commands start to build both Deno and NPM modules, and shove them both into the build database. This allows us to choose between the two runtimes at runtime :grin: We could then roll out the change in the runtime by either adding a command line flag or env variable, which would be pretty easy to roll back. We could even do this on a per-shard basis if we want to be extra careful.

In terms of the nitty gritty, I imagine that the choice of whether to use Deno or NPM would be communicated to the derive pipeline when it's configured. The derive pipeline would then create either HTTP or Deno invocations as appropriate. This is admittedly a little weird, but it allows the decision to be deferred until runtime, which I think is desirable. Once all derivations are switched over and happy, we can then make Deno the default and remove support for Node altogether.

jgraettinger commented 1 year ago

Great write up, thank you. Your notes and next steps all make sense to me.

We could potentially also use a separate "realm" ("context" in V8 terminology) for each lambda within an isolate, to guard against users who may try to update some global JS state in one lambda invocation and then read it in another.

If we did this, then we'd presumptively have a main.ts per lambda and thus not have a lookup step, correct?

We may have our hands forced some by the architecture of Pipeline, which does model invocations of update and publish lambdas as fully independent and concurrent executions. Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but :shrug: we'll see. The existing Block-centric processing model may also make it desirable to go straight to vectorized execution.

psFried commented 1 year ago

Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but shrug we'll see.

Yeah, that's my thought, too.

And also agreed that the Blocks seem to afford passing data in equivalent chunks.

psFried commented 1 year ago

If we did this, then we'd presumptively have a main.ts per lambda and thus not have a lookup step, correct?

Yeah, I think that's roughly how it would work. I'm not sure if we'd still need a single "main" main.js that ties them all together, though. It'll take some testing and research to figure out the proper way to drive multiple async scripts simultaneously, but I think the naive approach that I'd try first would be to simply skip the call to load_main_module and instead try driving each as a separate "side module".

psFried commented 1 year ago

I had a conversation with Joseph the other day, where he brought up WebAssembly for derivations. WASM is pretty neat, and we've always felt that it seems like a good fit for Flow derivations (and even custom reduction functions). But it didn't really feel like a good "first" runtime for derivations, for a few reasons. The ecosystem was very new, and the tooling very immature. And the languages that seemed most appealing (Python, Typescript, SQL, etc.) didn't really have a good way to compile to WASM. We really wanted a statically typed language so that we could do end-to-end type checking of data pipelines, and so Typescript was a pretty solid choice. And Deno gives us a great way to have sandboxed execution. But WASM still seems pretty appealing for a few reasons:

It allows for derivations to be written in a pretty large variety of languages.
It's fast and efficient.
It seems to be gaining momentum, and there's reason to believe that both the breadth and quality of language support will continue to improve.

Basically, there's good reasons to support both Typescript and WASM, but Typescript won out, which was probably good. But wouldn't it be cool if we could compile Typescript to WASM? Then we could get the benefits of both, without having to support two separate runtimes in the derive pipeline. Turns out that Assemblyscript (also referred to as 'AS') has made a lot of progress in the last few years. And maybe it could allow us to use WASM instead of Deno. I'm honestly a little skeptical, but it seems worth exploring, so these are my notes on the subject.

Assemblyscript language differences

Assemblyscript is a different language than Typescript, even though they have a ton of overlap.

It's hard to say how Assemblyscript would be received. It seems to be popular with some in the blockchain crowd, but people who already know TS would likely be frustrated by at least some of the differences.
It's not clear how many of the packages in NPM are actually usable by AS. There would certainly be fewer libraries available, but it's unclear whether or not there would actually be a significant lack of library support from a users perspective.
Assemblyscript doesn't have support for union types or the any keyword. At a minimum, our Typescript generation would need to be updated to accommodate this, since it needs to do something when the JSON schema allows multiple types.
The as-json package, which seems to be the go-to JSON library, doesn't have support for dynamic JSON values.
So it seems like in order to support fields with multiple types, we'd need to roll our own dynamic JSON values for AS. Gross. Maybe not the end of the world, but dang.
AS still seems somehow simpler and more straight forward than TS. TS inherits a ton of weirdness from JS, which AS isn't limited by. For example, AS has actual numeric types, which can be a legitimate help when implementing numeric transforms where you'd need to find ways to work around the JS number type.
AS is actually a typed language, instead of just layer of typed "sugar" over JS. In that, it certainly seems like a far more "sane" language, at least from the perspective of a backend developer.
I haven't really looked into the long term viability of the AS project, in terms of maintainers, governance, etc.

WASM Runtime integration

Using AS implies using WASM. So there's a lot of questions about how the runtime would integrate with Flow.

There's a number of different WASM runtimes out there. Wasmer seems to maybe be the most complete, and that's the only one where I actually looked at code and docs.
Wasmer seems pretty easy to embed in Rust. You can lookup exported functions by name, and invoke them, so the WASM module itself doesn't need to include as much "glue code" for passing data back and forth.
Well, there's still got to be boilerplate, since WASM doesn't have a standard way of passing structured data across module boundaries.
So in order to pass documents and registers into the WASM module, you need to allocate pages of memory inside the WASM runtime, copy over the documents as serialized JSON, then invoke a function with the address of that JSON. The WASM module itself would need to parse the JSON, and it would also need to serialize it's outputs as JSON.
We could also have the WASM module call back to the host environment in order to retrieve documents, similar to the proposed Deno integration. Either way, the data must be passed in the same way.
Note that the issues with parsing numbers from JSON ought to be handled well by the AS JSON parser, since AS has reasonable numeric types.
So we'd need to generate some boilerplate within the AS module that parses the JSON inputs, invokes the user-defined lambda, then serializes its outputs to a destination buffer.

Other thoughts

Deno can also execute plain JS, which might be seen as a benefit by some. There's currently no way (sane) way to execute JS in WASM.
WASM supports a large and increasing number of languages. But at the present moment, the languages that data engineers would likely see as most valuable (Python, SQL, etc.) aren't well supported.
Python is another language that we might theoretically want to support via WASM. I checked in on the status of Pyodide, and found that it currently only supports running Python scripts from a JS environment (web/node/maybe deno). Additionally, it does not produce language bindings in the same way that AS does. AS compiles to WASM modules that export functions matching the TS module exports. But Pyodide exports only a single function for running a Python script. So I don't think that a pure WASM runtime would get us closer to supporting Python, at least in the short term.
There's a project called DuckDB, which allows running SQL in WASM. This seems to also be focused on being called from JS. In addition, there's of course loads of questions about how to reconcile SQL semantics with Flow derivations. The point being that it's not at all clear that a plain WASM runtime would get us closer to suppoting SQL as a transformation language.
Given the relative prevalence of WASM things being designed for being called from JS, it's possible that calling WASM modules from JS in Deno might end up being preferable, anyway.

I think that Assemblyscript might actually be a better language for transforms than Typescript, at least in some respects. But there's still a lot of unknowns there, and also still reasons to support Deno as an execution environment. In the end, I think Deno is probably still the best next step for us, and it certainly doesn't preclude supporting a standalone WASM runtime in the future.

dhammika commented 1 year ago

Nice write up! Can you give some pointers on how node is invoked in current setup? https://github.com/estuary/flow/blob/2680cfae863e563db5efce8c5040d5c4084d564c/go/flow/js_worker.go#L23-L62 With deno, it sounds more like we're thinking of a FFI type tightly coupled in process invocation model.

psFried commented 1 year ago

This no longer seems warranted to put on our roadmap

estuary / flow