User code interfaces meta-issue

We have several ongoing discussions that involve specifications of interfaces between Flow and code that is (at least potentially) user-provided. I'd like to consolidate the discussion into a single issue. The use cases so far are for Materializations, Captures (possibly including parsers?), and of course Update and Publish Lambdas. There have been prior discussions in #78, #82, #70, and on slack.

It would be nice (though not required) if we could use the same mechanisms for all these interactions between Flow and what I'll broadly classify as "user code". But update and publish lambdas are by far the most important, so we should focus primarily on those in the design.

The goals are:

Allow users to write their code in a variety of popular languages, and have the same level of support for all of them.
Allow users to continue using whatever build tools they're already familiar with. Building a flow catalog should not impose undue restrictions on how users build their code.
Invoking their code has got to be fast. We can probably tolerate going through a unix socket, but we can't require that every call has to cross an actual network.
We need to be very developer friendly, especially when it comes to update/publish lambdas. This means minimizing the amount of ceremony required to create a new lambda and wire it in.
Secure and Isolated. This is of course critical for any sort of mutli-tenant cluster, but it's also not something we can ignore for single-tenant clusters, since we wouldn't want one misbehaving lambda to screw up an unrelated process that happens to be running on the same machine.

So far, we've uncovered several main aspects of this problem domain.

Packaging and Running

The first thing is that Flow needs to define what user code is concretely. Flow needs to be able to control the lifecycle of user code and communicate with it to invoke functions in a language-agnostic way. I've been thinking about this as mostly decoupled from the concerns of an interface definition language like openapi or grpc. Regardless of which IDL we use, we'll still need a way to move around bundles of user code and start and stop them.

Containers are a popular solution for this. They also have the benefit that many developers are already familiar with the basics of how to build them. Running untrusted containers in a multi-tenant kubernetes cluster is, at best, extremely difficult to get right. Additionally, the user code within the container will have to be in carge of setting up some sort of server to listen on a socket. That's not a problem by itself, but it may require that we generate some code beyond just function stubs to configure the server and start listening.

WebAssembly is another packaging option that's worth considering because it makes it easy to run untrusted code securely. Wasm is still really new, though, and so is the tooling for building wasm modules, which could make things seems more difficult to users who already know how to build docker images. And not all languages can compile to WASM, so this may come with some limitations.

In either case, an important consideration is the proliferation of separate user code packages. Say for example we decided to build a separate package for every lambda of every derivation. The overhead of all those separate language runtimes and libraries loaded could be enormous. Anecdotally, running /usr/bin/time -v nodejs -e 'console.log("what a waste")' shows a max memory usage of ~34MB, and that's without loading a single library. And heaven help us if anyone tries to write in a JVM language. So it's pretty clear that we won't want to run separate packages for every single lambda. But what's the right relationship between Flow concepts and packages of user code? One package per Collection? Per flow.yaml file? Should we make it flexible and let users decide? I have a few ideas here, but nothing well formed enough to be worth sharing yet.

Generating Stubs

The second thing is that we need to be able to generate code so that all the user has to do is open the right file and start typing in their lambda code. This is clearly related to some of the concerns about packaging and running, but there's a lot that's separate. The main idea here is to use a language agnostic interface definition language like protobufs or openapi to generate server stubs for each individual language. The output being of course the type definitions, along with function stubs like:

function publishFromFoo(source: MyType, register: MyRegister, previous: MyRegister) { 
    // TODO: write your code here
}

We may also generate additional boilerplate code for setting up the server and listening on a unix socket, or for building the user code into a runnable artifact. This stuff could also be separate, though, since it's typically only needed once, whereas types and interfaces are likely to evolve more rapidly as the collections do.

Workflow

While writing this, I've become convinced that deciding on specific IDL and packaging formats isn't really getting at the heart of the issue. Both those issues fail to address the more basic concern of the overall workflow. Say a user starts with just a flow.yaml file and ends with a fully running catalog. What steps happen along the way? One possible sequence is:

Write flow.yaml
Run flowctl build to create catalog.
Run flowctl show openapi > myOpenApi.yaml to generate openapi yaml.
Run openapi-generator-cli generate --generator-name nodejs-express-server -o myLambdaDir to turn openapi yaml into stubs in project.
Fill in generated lambda stubs with implementations
Run some TBD build command to build your docker image, wasm module, whatever (waves hands)
Run flowctl apply to take your fully packaged code and start running it.

It's admittedly looking a little rough, especially bit hand waving in the build step. Neither openApi or protobufs will generate code for building docker images for you, so we'll have to figure something else out for that. Code generation could possibly be broken down into multiple steps. The first step being a one-time "project" generator that gives you a directory with build code included, and the second step being a "types" generator that generates types and interfaces and is re-run whenever the flow.yaml is modified. There's a few other considerations and constraints that are worth mentioning:

Considering imported catalogs and how this works across organizational boundaries.
Considering evolution of collections and user code.
Minimizing the complexity for trivially simple lambdas and examples/demos.

First steps

In any case, there's a lot of unresolved questions related to the workflow, which may have implications for packaging and stub generation. We need to answer those in order to fully realize the dream of a maximally flexible and developer friendly design for interactions between flow and user code, but I don't think we need to do it all right away. A good first step might be to start with just the nodejs runtime and typescript generator we already have, and experiment with the overall workflow using those. Once we feel comfortable with the workflow for simple examples, a good next step might be to figure out how to deal with imported catalogs and multiple packages of user code. Neither of those things require any changes to how we run or invoke user code. Once we get those issues sorted out, we can likely make much more informed decisions on the rest of this.

We just talked about this on a VC, and I'll summarize here. Basically, we had agreement that making any major changes to packaging and code generation right now would be putting the cart before the horse. We can keep this discussion open to accumulate our learnings, until such time as we're ready to take on that work. Until then, we'd continue to only support the existing nodejs runtime with typescript, and make some incremental changes to the workflow to expose type/interface generation as a separate step, and experiment with that.

estuary / flow

User code interfaces meta-issue #84