estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
637 stars 56 forks source link

User code interfaces meta-issue #84

Closed psFried closed 3 years ago

psFried commented 3 years ago

We have several ongoing discussions that involve specifications of interfaces between Flow and code that is (at least potentially) user-provided. I'd like to consolidate the discussion into a single issue. The use cases so far are for Materializations, Captures (possibly including parsers?), and of course Update and Publish Lambdas. There have been prior discussions in #78, #82, #70, and on slack.

It would be nice (though not required) if we could use the same mechanisms for all these interactions between Flow and what I'll broadly classify as "user code". But update and publish lambdas are by far the most important, so we should focus primarily on those in the design.

The goals are:

So far, we've uncovered several main aspects of this problem domain.

Packaging and Running

The first thing is that Flow needs to define what user code is concretely. Flow needs to be able to control the lifecycle of user code and communicate with it to invoke functions in a language-agnostic way. I've been thinking about this as mostly decoupled from the concerns of an interface definition language like openapi or grpc. Regardless of which IDL we use, we'll still need a way to move around bundles of user code and start and stop them.

Containers are a popular solution for this. They also have the benefit that many developers are already familiar with the basics of how to build them. Running untrusted containers in a multi-tenant kubernetes cluster is, at best, extremely difficult to get right. Additionally, the user code within the container will have to be in carge of setting up some sort of server to listen on a socket. That's not a problem by itself, but it may require that we generate some code beyond just function stubs to configure the server and start listening.

WebAssembly is another packaging option that's worth considering because it makes it easy to run untrusted code securely. Wasm is still really new, though, and so is the tooling for building wasm modules, which could make things seems more difficult to users who already know how to build docker images. And not all languages can compile to WASM, so this may come with some limitations.

In either case, an important consideration is the proliferation of separate user code packages. Say for example we decided to build a separate package for every lambda of every derivation. The overhead of all those separate language runtimes and libraries loaded could be enormous. Anecdotally, running /usr/bin/time -v nodejs -e 'console.log("what a waste")' shows a max memory usage of ~34MB, and that's without loading a single library. And heaven help us if anyone tries to write in a JVM language. So it's pretty clear that we won't want to run separate packages for every single lambda. But what's the right relationship between Flow concepts and packages of user code? One package per Collection? Per flow.yaml file? Should we make it flexible and let users decide? I have a few ideas here, but nothing well formed enough to be worth sharing yet.

Generating Stubs

The second thing is that we need to be able to generate code so that all the user has to do is open the right file and start typing in their lambda code. This is clearly related to some of the concerns about packaging and running, but there's a lot that's separate. The main idea here is to use a language agnostic interface definition language like protobufs or openapi to generate server stubs for each individual language. The output being of course the type definitions, along with function stubs like:

function publishFromFoo(source: MyType, register: MyRegister, previous: MyRegister) { 
    // TODO: write your code here
}

We may also generate additional boilerplate code for setting up the server and listening on a unix socket, or for building the user code into a runnable artifact. This stuff could also be separate, though, since it's typically only needed once, whereas types and interfaces are likely to evolve more rapidly as the collections do.

Workflow

While writing this, I've become convinced that deciding on specific IDL and packaging formats isn't really getting at the heart of the issue. Both those issues fail to address the more basic concern of the overall workflow. Say a user starts with just a flow.yaml file and ends with a fully running catalog. What steps happen along the way? One possible sequence is:

It's admittedly looking a little rough, especially bit hand waving in the build step. Neither openApi or protobufs will generate code for building docker images for you, so we'll have to figure something else out for that. Code generation could possibly be broken down into multiple steps. The first step being a one-time "project" generator that gives you a directory with build code included, and the second step being a "types" generator that generates types and interfaces and is re-run whenever the flow.yaml is modified. There's a few other considerations and constraints that are worth mentioning:

First steps

In any case, there's a lot of unresolved questions related to the workflow, which may have implications for packaging and stub generation. We need to answer those in order to fully realize the dream of a maximally flexible and developer friendly design for interactions between flow and user code, but I don't think we need to do it all right away. A good first step might be to start with just the nodejs runtime and typescript generator we already have, and experiment with the overall workflow using those. Once we feel comfortable with the workflow for simple examples, a good next step might be to figure out how to deal with imported catalogs and multiple packages of user code. Neither of those things require any changes to how we run or invoke user code. Once we get those issues sorted out, we can likely make much more informed decisions on the rest of this.

We just talked about this on a VC, and I'll summarize here. Basically, we had agreement that making any major changes to packaging and code generation right now would be putting the cart before the horse. We can keep this discussion open to accumulate our learnings, until such time as we're ready to take on that work. Until then, we'd continue to only support the existing nodejs runtime with typescript, and make some incremental changes to the workflow to expose type/interface generation as a separate step, and experiment with that.

jgraettinger commented 3 years ago

A lot has changed and solidified in terms of development ergonomics and workflow since this issue was opened. Phil, are there outsanding actionable bits from this? Otherwise we can keep talking, but I'd like to close the issue.

psFried commented 3 years ago

Nothing actionable here now that stub generation is taken care of. Closing.