estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
603 stars 48 forks source link

Use Deno for derivation runtime #760

Closed psFried closed 1 year ago

psFried commented 1 year ago

The goal is to use Deno instead of NodeJS for executing transform lambdas. The primary motivation for this is security and isolation. The current NodeJS implementation is difficult to sandbox, and Deno has a much better story around that.

psFried commented 1 year ago

I've done some research and playing around with Deno and deno_core and wanted to share some notes on how I see the major pieces fitting together.

Build-time operations

In addition to executing the lambdas at runtime, there's a number of other things that we need to do during builds.

Currently, those operations are handled by a mix of npm and tsc. Deno provides all of those operations in a single subcommand called bundle, which emits a single Javascript file with all dependencies inlined. This would be stored in the build database with a DENO_MODULE type in the resources table. Integrating Deno as a library gives a much more low level interface where all those operations are separate. I'd propose that, at least for starting out, we use the CLI for build-time operations. The Deno CLI would be added to our docker image, and the builds would shell out to deno bundle to produce a fully inlinded JS file for each derivation.

Changing the scope of the bundle

Our current NPM modules are scoped to the build. That is, there is a single npm module that includes the code for all the derivations within a given build. As part of our move to Deno, we should change the scope of those so that we end up with a separate module (and a separate V8 isolate) for each derivation. Having a separate isolate per derivation allows for better isolation, as the name implies, and also allows multiple derivations from the same build to run in parallel on the same reactor.

Runtime integration

I think the basic form of the integration would be an Extension that we implement. This extension would provide ops that allow JS code to read in documents from the runtime and send results back to the runtime. We would then have an automatically generated main.ts module that drives the execution of user-provided transform functions using those primitives. Something like this extremely hand-wavey illustrative example:

while (true) {
  let next_doc = Flow.getNextSourceDocument();
  let userFunction = lookupUserFunction(next_doc.source_meta);
  let result = userFunction(next_doc.document);
  Flow.sendResult(result);
}

A particular detail I'd like to point out is the lookupUserFunction function. We need some way to associate source documents with specific update/publish lambdas. There's lots of different ways we could do that, and I'm not suggesting any particular implementation just yet (and don't know what exactly source_meta is). But some options that I see there are:

Some other things to note about runtime integration:

psFried commented 1 year ago

In terms of the migration, I think it makes sense to have the flowctl-go api commands start to build both Deno and NPM modules, and shove them both into the build database. This allows us to choose between the two runtimes at runtime :grin: We could then roll out the change in the runtime by either adding a command line flag or env variable, which would be pretty easy to roll back. We could even do this on a per-shard basis if we want to be extra careful.

In terms of the nitty gritty, I imagine that the choice of whether to use Deno or NPM would be communicated to the derive pipeline when it's configured. The derive pipeline would then create either HTTP or Deno invocations as appropriate. This is admittedly a little weird, but it allows the decision to be deferred until runtime, which I think is desirable. Once all derivations are switched over and happy, we can then make Deno the default and remove support for Node altogether.

jgraettinger commented 1 year ago

Great write up, thank you. Your notes and next steps all make sense to me.

We could potentially also use a separate "realm" ("context" in V8 terminology) for each lambda within an isolate, to guard against users who may try to update some global JS state in one lambda invocation and then read it in another.

If we did this, then we'd presumptively have a main.ts per lambda and thus not have a lookup step, correct?

We may have our hands forced some by the architecture of Pipeline, which does model invocations of update and publish lambdas as fully independent and concurrent executions. Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but :shrug: we'll see. The existing Block-centric processing model may also make it desirable to go straight to vectorized execution.

psFried commented 1 year ago

Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but shrug we'll see.

Yeah, that's my thought, too.

And also agreed that the Blocks seem to afford passing data in equivalent chunks.

psFried commented 1 year ago

If we did this, then we'd presumptively have a main.ts per lambda and thus not have a lookup step, correct?

Yeah, I think that's roughly how it would work. I'm not sure if we'd still need a single "main" main.js that ties them all together, though. It'll take some testing and research to figure out the proper way to drive multiple async scripts simultaneously, but I think the naive approach that I'd try first would be to simply skip the call to load_main_module and instead try driving each as a separate "side module".

psFried commented 1 year ago

I had a conversation with Joseph the other day, where he brought up WebAssembly for derivations. WASM is pretty neat, and we've always felt that it seems like a good fit for Flow derivations (and even custom reduction functions). But it didn't really feel like a good "first" runtime for derivations, for a few reasons. The ecosystem was very new, and the tooling very immature. And the languages that seemed most appealing (Python, Typescript, SQL, etc.) didn't really have a good way to compile to WASM. We really wanted a statically typed language so that we could do end-to-end type checking of data pipelines, and so Typescript was a pretty solid choice. And Deno gives us a great way to have sandboxed execution. But WASM still seems pretty appealing for a few reasons:

Basically, there's good reasons to support both Typescript and WASM, but Typescript won out, which was probably good. But wouldn't it be cool if we could compile Typescript to WASM? Then we could get the benefits of both, without having to support two separate runtimes in the derive pipeline. Turns out that Assemblyscript (also referred to as 'AS') has made a lot of progress in the last few years. And maybe it could allow us to use WASM instead of Deno. I'm honestly a little skeptical, but it seems worth exploring, so these are my notes on the subject.

Assemblyscript language differences

Assemblyscript is a different language than Typescript, even though they have a ton of overlap.

WASM Runtime integration

Using AS implies using WASM. So there's a lot of questions about how the runtime would integrate with Flow.

Other thoughts

I think that Assemblyscript might actually be a better language for transforms than Typescript, at least in some respects. But there's still a lot of unknowns there, and also still reasons to support Deno as an execution environment. In the end, I think Deno is probably still the best next step for us, and it certainly doesn't preclude supporting a standalone WASM runtime in the future.

dhammika commented 1 year ago

Nice write up! Can you give some pointers on how node is invoked in current setup? https://github.com/estuary/flow/blob/2680cfae863e563db5efce8c5040d5c4084d564c/go/flow/js_worker.go#L23-L62 With deno, it sounds more like we're thinking of a FFI type tightly coupled in process invocation model.

psFried commented 1 year ago

This no longer seems warranted to put on our roadmap