Closed psFried closed 1 year ago
I've done some research and playing around with Deno and deno_core
and wanted to share some notes on how I see the major pieces fitting together.
In addition to executing the lambdas at runtime, there's a number of other things that we need to do during builds.
Currently, those operations are handled by a mix of npm
and tsc
. Deno provides all of those operations in a single subcommand called bundle
, which emits a single Javascript file with all dependencies inlined. This would be stored in the build database with a DENO_MODULE
type in the resources
table. Integrating Deno as a library gives a much more low level interface where all those operations are separate. I'd propose that, at least for starting out, we use the CLI for build-time operations. The Deno CLI would be added to our docker image, and the builds would shell out to deno bundle
to produce a fully inlinded JS file for each derivation.
Our current NPM modules are scoped to the build. That is, there is a single npm module that includes the code for all the derivations within a given build. As part of our move to Deno, we should change the scope of those so that we end up with a separate module (and a separate V8 isolate) for each derivation. Having a separate isolate per derivation allows for better isolation, as the name implies, and also allows multiple derivations from the same build to run in parallel on the same reactor.
deno_core
library is much more low level than the CLI. Even things like console.log
need to be explicitly configured. op
s, which are registered at runtime creation, and then you expose each op
as a JS function that client code can call.I think the basic form of the integration would be an Extension
that we implement. This extension would provide op
s that allow JS code to read in documents from the runtime and send results back to the runtime. We would then have an automatically generated main.ts
module that drives the execution of user-provided transform functions using those primitives. Something like this extremely hand-wavey illustrative example:
while (true) {
let next_doc = Flow.getNextSourceDocument();
let userFunction = lookupUserFunction(next_doc.source_meta);
let result = userFunction(next_doc.document);
Flow.sendResult(result);
}
A particular detail I'd like to point out is the lookupUserFunction
function. We need some way to associate source documents with specific update/publish lambdas. There's lots of different ways we could do that, and I'm not suggesting any particular implementation just yet (and don't know what exactly source_meta
is). But some options that I see there are:
getNextSourceDocument
next_doc
Some other things to note about runtime integration:
op
call (though I suspect it's tiny). I say this only to point out that there could be some benefit to batching documents as they transit the Rust<=>V8 boundary, and thus the actual API might change a bit.console.log
and console.error
implementations that directly forward to our log publisher.main.js
has to be generated at build time. This could be done at runtime, though I don't necessarily think that's what we should do.In terms of the migration, I think it makes sense to have the flowctl-go api
commands start to build both Deno and NPM modules, and shove them both into the build database. This allows us to choose between the two runtimes at runtime :grin: We could then roll out the change in the runtime by either adding a command line flag or env variable, which would be pretty easy to roll back. We could even do this on a per-shard basis if we want to be extra careful.
In terms of the nitty gritty, I imagine that the choice of whether to use Deno or NPM would be communicated to the derive pipeline when it's configured. The derive pipeline would then create either HTTP or Deno invocations as appropriate. This is admittedly a little weird, but it allows the decision to be deferred until runtime, which I think is desirable. Once all derivations are switched over and happy, we can then make Deno the default and remove support for Node altogether.
Great write up, thank you. Your notes and next steps all make sense to me.
We could potentially also use a separate "realm" ("context" in V8 terminology) for each lambda within an isolate, to guard against users who may try to update some global JS state in one lambda invocation and then read it in another.
If we did this, then we'd presumptively have a main.ts
per lambda and thus not have a lookup step, correct?
We may have our hands forced some by the architecture of Pipeline
, which does model invocations of update and publish lambdas as fully independent and concurrent executions. Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but :shrug: we'll see. The existing Block
-centric processing model may also make it desirable to go straight to vectorized execution.
Having an async function / generator thing per lambda may be simpler to integrate and reduce coupling, but shrug we'll see.
Yeah, that's my thought, too.
And also agreed that the Block
s seem to afford passing data in equivalent chunks.
If we did this, then we'd presumptively have a main.ts per lambda and thus not have a lookup step, correct?
Yeah, I think that's roughly how it would work. I'm not sure if we'd still need a single "main" main.js
that ties them all together, though. It'll take some testing and research to figure out the proper way to drive multiple async scripts simultaneously, but I think the naive approach that I'd try first would be to simply skip the call to load_main_module
and instead try driving each as a separate "side module".
I had a conversation with Joseph the other day, where he brought up WebAssembly for derivations. WASM is pretty neat, and we've always felt that it seems like a good fit for Flow derivations (and even custom reduction functions). But it didn't really feel like a good "first" runtime for derivations, for a few reasons. The ecosystem was very new, and the tooling very immature. And the languages that seemed most appealing (Python, Typescript, SQL, etc.) didn't really have a good way to compile to WASM. We really wanted a statically typed language so that we could do end-to-end type checking of data pipelines, and so Typescript was a pretty solid choice. And Deno gives us a great way to have sandboxed execution. But WASM still seems pretty appealing for a few reasons:
Basically, there's good reasons to support both Typescript and WASM, but Typescript won out, which was probably good. But wouldn't it be cool if we could compile Typescript to WASM? Then we could get the benefits of both, without having to support two separate runtimes in the derive pipeline. Turns out that Assemblyscript (also referred to as 'AS') has made a lot of progress in the last few years. And maybe it could allow us to use WASM instead of Deno. I'm honestly a little skeptical, but it seems worth exploring, so these are my notes on the subject.
Assemblyscript is a different language than Typescript, even though they have a ton of overlap.
any
keyword. At a minimum, our Typescript generation would need to be updated to accommodate this, since it needs to do something when the JSON schema allows multiple types.as-json
package, which seems to be the go-to JSON library, doesn't have support for dynamic JSON values.number
type.Using AS implies using WASM. So there's a lot of questions about how the runtime would integrate with Flow.
I think that Assemblyscript might actually be a better language for transforms than Typescript, at least in some respects. But there's still a lot of unknowns there, and also still reasons to support Deno as an execution environment. In the end, I think Deno is probably still the best next step for us, and it certainly doesn't preclude supporting a standalone WASM runtime in the future.
Nice write up! Can you give some pointers on how node is invoked in current setup? https://github.com/estuary/flow/blob/2680cfae863e563db5efce8c5040d5c4084d564c/go/flow/js_worker.go#L23-L62 With deno, it sounds more like we're thinking of a FFI type tightly coupled in process invocation model.
This no longer seems warranted to put on our roadmap
The goal is to use Deno instead of NodeJS for executing transform lambdas. The primary motivation for this is security and isolation. The current NodeJS implementation is difficult to sandbox, and Deno has a much better story around that.