denoland / deploy_feedback

For reporting issues with Deno Deploy
https://deno.com/deploy
74 stars 5 forks source link

Durable Objects aka Coordinators #88

Open vwkd opened 3 years ago

vwkd commented 3 years ago

What?

Currently, it’s not possible to store data in a Worker (pending name #105). #110 will allow to store data with eventual consistency. Often it’s desirable to store data in a Worker with strong consistency like for a read-write database. This is part of an even broader use case: coordination. For example broadcasting a user's message from one Worker instance to all users on all other instances like in a real-time chat or collaborative document editing. Currently, Worker instances can not coordinate. Note, sending data directly between the instances (e.g. BroadcastChannel) wouldn’t work because for simultaneous requests to different instances the resulting order of requests on the instances could diverge.

For coordination, there needs to be a single instance that is the sole point of synchronization. Currently, the only way is to leave the edge and use a third party backend such as Fauna. But what if we could do all of this still on the edge? Enter "Durable Objects"! Ugh oh... Sounds like random words pulled out from a hat! For the rest of this proposal I'll use the name "Coordinator". It could have been anything that acts as a central hub, like "Distributor", "Manager", etc.

A Coordinator is a FaaS just like a Worker. The difference is that a Coordinator creates only a single instance where a Worker creates multiple, and this single Coordinator instance is only accessible by other Worker / Coordinator instances from the same account not by the open Internet. Notice that a Coordinator and a Worker differ only in how they are deployed. A Coordinator creates a single non-publicly accessible instance where a Worker creates multiple publicly accessible instances. A Coordinator is just a Worker that’s deployed differently! (A Coordinator isn’t a “Stateful Worker” since a Worker can hold state as well (pending #110) [^statefulworker].)

Think of the edge like a world-wide call center. Many employees sit around the world, where each employee answers the phone calls from the closest users. A person can only talk to one other person at a time, although having multiple calls on hold. Each employee has their own unique order of calls. There is a single manager for the whole call center. Only the employees can call the manager, the users can not call it directly. Now the employees decide to call the manager after each call from a user, so the manager receives all the calls. The manager can now reply its order of the calls to each employee, resulting in the employees all agreeing on this one same order of calls. Employees and managers are exactly the same, they are people, they are in an office, they talk on the phone, they speak the same language, etc. The only difference is how many there are, and to whom they can talk to. (employees = multiple Worker instances, manager = single Coordinator instance, people = FaaS).

Coordinators allow to use a FaaS for coordination and strongly consistent storage for which up until now a separate server or service was needed. Coordinators are the missing piece in a serverless platform [^missingpiece]. I hope Deno Deploy can provide them as part of a modern browser-compatible serverless platform.

How?

A Coordinator is a Worker that’s deployed differently (single location, not publicly accessible). Therefore, the runtime can be identical and no new runtime APIs are necessary. Same storage API #110. Everything's the same! A Coordinator can be written just like a Worker, the only difference being how the project is deployed in the CLI / UI.

This might seem obvious to you by now, but this wasn’t obvious to CF. They made one fatal design decision which is the single origin of all pain points with Durable Objects. Instead of using a separate project for a Durable Object just like for another Worker, they overload a Worker project to also define a Durable Object. Now one project creates two different deployments. The fact that it doesn’t make logical sense - the code defining the Worker runs in multiple instances while the code defining the Durable Object in a single other instance - is the smallest issue. Not using two separate projects, there now need to be classes and configuration files to identify the Coordinator code within the Worker code. There need to be new runtime APIs to create and access the Coordinator from the Worker. To make the inconsistent consistent, a Worker must be a (Node) ES module (.mjs file extension…) that exports an object. The distinction between Coordinator and Worker made its way into the code. All of this is unnecessary if you think of a Coordinator as a Worker that’s deployed differently. I took some time to understand Durable Objects. CF didn't do a particularly great job in explaining them. By now I actually believe they themselves can't see the forest from the trees. The name “Durable Objects” reflects this.

Here’s how Deno Deploy can do it. A Coordinator project is created similar to a Worker project in the UI. It is deployed only to a single location instead of multiple. It has a URL <project>.deno.dev just like a Worker. But this URL doesn't work on the public Web, i.e. the Deno Deploy server doesn't serve anything there. This URL works only from within other instances using the Fetch API. For this, the runtime intercepts all fetch requests and routes any to a Coordinator internally. It looks like fetch requests to a Coordinator go out to the open Internet but they never leaves the Deno Deploy world [^sveltekit]. No new API necessary! The same old Fetch API! Just like two Workers can talk to each other. No config file, no bindings! Note, a config file doesn’t increase security since whoever writes the code can also write a config file [^config].

This makes fetch requests to a Coordinator (Worker-to-Coordinator, Coordinator-to-Coordinator) fast since they (necessarily) stay in the Deno Deploy world. Meanwhile, fetch requests to a Worker (Coordinator-to-Worker, Worker-to-Worker) are slow because they go out to the open Internet just to come back to the Deno Deploy world. It would make sense to route fetch requests to a Worker internally as well even if the Internet would route them successfully. This would make the interception simpler as well since the runtime doesn't need to distinguish between fetch requests to a Coordinator and Worker anymore. It can route any fetch request to *.deno.dev internally. Custom domains might complicate this slightly and need to also do a lookup in a domain table that the instances keep in sync. CF similarly routes fetch requests internally for Coordinator-to-Worker as part of their runtime API, and also for Worker-to-Worker as part of yet another new Service Bindings API [^servicebindings] since they don’t intercept the Fetch API in a Worker.

There is one use case of Coordinators that I’ve glossed over until now to keep things simple. Using a separate Coordinator for a small logical chunk of data [^logicalchunk]. For example, for rate limiting an IP there can be one Coordinator per IP that keeps track of the number of requests from that IP, for collaborative document editing there can be one Coordinator per document, for user management there can be one Coordinator per account, etc. For this, it's necessary to create many duplicate Coordinators that run identical code. A duplicate Coordinator could be created by manually creating a separate project in advance which just happens to use the identical code from the same Git repository just like a duplicate Worker can be created. But this won’t scale for many. There needs to be a way to create a duplicate Coordinator on-the-fly at runtime without having to create it in advance manually. Likely this is where the object-oriented programmers at CF thought they need to use classes since that's how they duplicate code. This also explains the name [^naming]. But there don’t need to be classes if the entire code is duplicated! Just like a Worker - whose entire code is duplicated across many instances - doesn't need to use classes either.

Here’s how Deno Deploy can do it. For a Coordinator project, one default Coordinator with internal URL <project>.deno.dev is created on deployment. During runtime of a Worker / Coordinator instance, on the very first fetch request to a subdomain <instance>.<project>.deno.dev , a duplicate Coordinator is created with that internal URL on-the-fly. Notice that the path can’t be used since that’s the Coordinator’s / Worker’s own API. The identifier <instance> is the named ID chosen by the programmer. No classes, no IDs. No new API necessary! The same old URLs. CF doesn't have a default Coordinator because they are limited by using the runtime API and so can only offer duplicate Coordinators.

Details

More details on this design.

Internal routing

For a request to a Coordinator, the runtime intercepts it and routes it within the Deno Deploy world to the Coordinator instance. The runtime could only have a table of all Coordinator identifiers whose instances it runs itself that it can then hook up locally to the requests. If it doesn’t have an identifier in the table it would need to lookup every other location around the globe until it finds the one that runs the instance (if any) which isn't performant. To avoid this, the table should likely contain the Coordinator identifiers and locations of all Coordinators everywhere such that it can send it directly to the right location. This table is synchronised between the locations whenever a Coordinator is created / deleted (Create Coordinator project in UI adds default Coordinator entry, create duplicate Coordinator on-the-fly adds duplicate Coordinator entry, delete duplicate Coordinator within Coordinator project in UI deletes duplicate Coordinator entry, and deleting whole Coordinator project deletes entries for default and all duplicate Coordinators). Also CF uses a direct link without touching the slow open Internet thanks to having their own network but this is likely impossible for Deno Deploy.

Knowing which Coordinators a Worker/Coordinator can access at upload-time might seem valuable. For this the platform either needs to parse out all URLs from fetch calls in the code or force the programer to declare the URLs in a config file that it can parse more easily. Apart from not guaranteeing that the code paths actually make the fetch call, this would only work with default Coordinators since the identifier for a duplicate Coordinators is likely dynamically created at runtime from some data, e.g. IP address, document ID, user ID, etc. Also the runtime now has two tables to maintain - the entire table of all Coordinators it knows and a reduced table that’s different for each Worker/Coordinator - which increases complexity. Searching the reduced table will likely not give an impactful performance gain over searching the entire table to warrant this.

First request performance penalty of duplicate Coordinator

For the first request to a duplicate Coordinator, there is a performance penalty because the duplicate Coordinator doesn't exist yet and needs to be created. The runtime can’t just create it after it checked that it doesn’t have an entry in its synchronised table because at the same time another request to the same duplicate Coordinator (same identifier) might happen at another location which would create it there as well. Therefore, the runtime needs to do a global lookup to all other locations before it can create the duplicate Coordinator. This is a one-time performance penalty paid for using a dynamically created duplicate Coordinator instead of using the ahead-of-time created default Coordinator. There is no way around this without giving up the benefits of Coordinators. A Coordinators’ usefulness comes exactly from being used by multiple Worker / Coordinator instances as a coordination point. The lookup can be avoided only if a duplicate Coordinator is used by only a single Worker / Coordinator instance. This gives up all the benefits of using a Coordinator in the first place since everything that can be done in the duplicate Coordinator can be done in the single Worker / Coordinator itself. Furthermore, the whole above design has to be given up and runtime APIs are required. This is because the runtime needs to be sure that the identifier of the duplicate Coordinator is unique so it can create it without worrying about other locations. But the unique identifier can’t come from the developer's code in the Worker / Coordinator like the URL in the fetch call since any developer written string could be known to other instances. Despite the limited use case CF offers to create a unique ID as part of its new runtime API. Note, the default Coordinator doesn't pay this first request performance penalty since it is already created when deploying the project. Since CF doesn't have a default Coordinator they always pay the first request performance penalty!

Request ordering

For requests to a Coordinator, CF makes ordering guarantees that two subsequent requests from the same instance remain subsequent [^ordering]. This is different from requests that go out to the open Internet where no ordering can be guaranteed. The programmer needs to now be aware that requests to Coordinators work slightly differently than requests to an endpoint on the open Internet. It also adds to the runtime’s complexity that now has to implement ordering logic. I think it’s best to make no ordering guarantees.

Isolate

A Coordinator instance would run in its own isolate just like a Worker instance. Instance equals isolate. CF instead runs Durable Object instances in the same isolate as the Worker instance of the Worker project where they're used [^isolate] [^isolate2]. This allows them to scale the amount of Durable Objects by three orders of magnitude since overhead is KB instead of MB [^isolate3] [^isolate4] [^isolate5]. But this fundamentally requires that Durable Objects are part of a Worker project instead of existing on their own. This opens the whole can of worms. Apart from new runtime APIs, this also subject to sharing global state, the same isolate RAM limit which makes the RAM limit of an individual instance unpredictable, etc. The implementation needs to add to spin up/down instances within an isolate. Maybe you notice this is reinventing the wheel... It's like worse isolates within isolates. Decreasing isolate overhead is already one of the central optimization points for Workers. Instead of doubling down on that, CF created yet another optimization point which as it's within the other will always be limited at the top by it. It’s an example of increasing complexity instead of reducing it. For example, switching to running WASM isolates could be such optimization, like Fastly does with Compute@Edge.

Moving location

(This might be a good thing to skip at the beginning and come back in the future after the initial design stands.)

A Coordinator instance could be automatically moved to the optimal location depending on where most of the traffic from the Worker instances comes from. (CF does this. They call it “auto-migration”, though don’t confuse it with the annoying “migrations” developers have to declare to identify updated Durable Object code within the overloaded Worker code.).

Deployment settings of duplicate Coordinator

(This might be a good thing to skip at the beginning and come back in the future after the initial design stands.)

For the default Coordinator any deployment settings are specified in the UI on project creation just like for a Worker. For a duplicate Coordinator the deployment settings that are used when first creating it need to be somehow specified in the request.

The deployment settings of a duplicate Coordinator could be specified in the URL using more subdomains, e.g. <setting3>.<setting2>.<setting1>.<instance>.<project>.deno.dev. But the fixed URL string requires to set all previous settings and keep a specific order. A better option would be to use a custom HTTP header, e.g. X-Deno-Deploy. See #127 for a similar argument in Workers.

For example, one deployment setting might be to specify the region the Coordinator instance is deployed to for regulatory compliance. (CF does this). See #93 for the equivalent for Workers. When creating a duplicate Coordinator, a sensible default for the region where the instance is created if no region is specified could be near the Worker / Coordinator instance that first requests it. (CF does this.)

Why?

More reasons for this design over CF's. See https://github.com/denoland/deploy_feedback/issues/88#issuecomment-939365326 for a code example.

Future

With Coordinators there are now two products. A multi-instance public FaaS (Worker) and a single-instance private FaaS (Coordinator). The differences are one vs. many instances and private vs. public access.

instances / access private public
one Coordinator
many Worker

In the future, one may want to think about the two remaining combinations. A multiple-instance private FaaS ("Assistant") and a single-instance public FaaS ("Greeter"). An Assistant would allow to move parallelizable yet confidential business logic to the edge, which currently still needs a monolith because it must not be publicly accessible. A Greeter is like a traditional single server but without needing to manage the application-level. Implementing these should be much easier with the existing logic for Workers and Coordinators in place as the existing features can be reused. You might want to think about better names though.

Further Reading

[^naming]: “But we also needed a name for the individual instances [of the class].”, i.e. the duplicated Coordinators. Source: https://news.ycombinator.com/item?id=24618459 [^config]: CF's argues it increases security. "> how security works? [..] To send a message, you must configure the sending Worker with a "Durable Object Namespace Binding". [..] Without the binding, there's no way to talk to Durable Objects in that namespace.". Source: https://news.ycombinator.com/item?id=24617903 [^ordering]: CF offers ordering. "When sending messages to a Durable Object, two messages sent with the same stub will be delivered in order". Source: https://news.ycombinator.com/item?id=24617903 [^statefulworker]: CF thinks “Stateful Worker” makes sense. "["Workers State"] is actually a name we considered, and as a name on its own, I like it a lot. [..] [It] may in fact have been a better name!". Source: https://news.ycombinator.com/item?id=24618030 [^sveltekit]: SvelteKit uses a similar idea of intercepting fetch when fetching from an endpoint in the load function of a module script. The endpoint doesn't exist on the Web. Instead, the response is computed locally during server-side rendering and injected into the server-side rendered page. On the client the fetch(url) call looks like it goes to the network, but is just served the injected response. Unfortunately, their docs are very sparse in explaining this. [^logicalchunk]: The developer has to still figure out what the right smallest chunk is in their case. For example, for a database one Coordinator for what was previously one table may or may not make sense, depending if you don't or do need to join tables. [^isolate]: “multiple objects may be hosted in the same isolate”. Source: https://news.ycombinator.com/item?id=24618562 [^isolate2]: "Objects [..] can be placed in the same isolate, if they are implemented by the same script.". Source: https://discord.com/channels/595317990191398933/773219443911819284/876619108504977439 [^isolate3]: “Multiple live objects may be hosted in the same isolate. We wanted the marginal overhead for each object to be measured in kilobytes, so that creating tons of them is just fine.” Source: https://twitter.com/KentonVarda/status/1310615343149314054 [^isolate4]: "an important design goal of durable objects is that it should scale to extremely large numbers of fine-grained objects. [..] If we said that each object gets its own isolate, that would unfortunately blow up this goal, since even though isolates are much cheaper than containers, they are still much more expensive than what we'd like to see for fine-grained objects.". Source: https://discord.com/channels/595317990191398933/773219443911819284/869608842789552228 [^isolate5]: "An isolate takes at least a few megabytes of RAM, but we want each durable object to have an overhead measured in kilobytes". Source: https://discord.com/channels/595317990191398933/773219443911819284/906619348519641099 [^nourl]: "Note that stub.fetch() has the same signature as the global fetch() function, except that the request is always sent to the object, regardless of the request's URL.". Source: https://developers.cloudflare.com/workers/learning/using-durable-objects#instantiating-and-communicating-with-a-durable-object [^lifetime]: "A Durable Object remains active until all asynchronous [operations are completed even if they aren't awaited]. From a Workers perspective, this is similar to enqueuing tasks with FetchEvent.waitUntil." Source: https://developers.cloudflare.com/workers/runtime-apis/durable-objects#durable-object-lifespan [^missingpiece]: "Durable Objects are the missing piece [..] that makes it possible for whole applications to run entirely on the edge, with no centralized "origin" server at all.". Source: https://blog.cloudflare.com/introducing-workers-durable-objects/ [^servicebindings]: “A service binding allows you to send HTTP requests to another service, without those requests going over the Internet.” Source: https://blog.cloudflare.com/introducing-worker-services/

vwkd commented 3 years ago

Here is an example using Coordinators. Notice how the code is identical, you can't even see if it's a Worker or Coordinator just from the code! Compare this to CF's Counter example using Durable Objects.

(Note, we assume a simplistic Deno.storage Storage API in a Worker / Coordinator (#110) which it likely won't be in reality but then we'd use some library that abstracts over it.)

Projects

1. Worker

Create project through UI, select deployment type "Worker".

Project name: afraid-sheep-21 Domain: https://afraid-sheep-21.deno.dev/ Git: https://github.com/username/my-worker/blob/HEAD/mod.js

import { serve } from "https://deno.land/std@0.114.0/http/server.ts";

serve(handleRequest);

async function handleRequest(request) {
    const pathname = new URL(request.url).pathname;
    // use default Coordinator instead of duplicate Coordinator since doesn't need more than one
    // no first request performance penalty since default Coordinator has been created at deployment
    const url = new URL(pathname, `https://difficult-butterfly-42.deno.dev/`)
    const res = await fetch(url);
    const count = await res.text();

    return new Response("Coordinator count: " + count);
}

2. Coordinator

Create project through UI, select deployment type "Coordinator".

Project name: difficult-butterfly-42 Domain: https://difficult-butterfly-42.deno.dev/ (doesn't exist on Web!) Git: https://github.com/username/my-coordinator/blob/HEAD/mod.js

import { serve } from "https://deno.land/std@0.114.0/http/server.ts";

const stored = await Deno.storage.get("value");
let value = stored || 0;

serve(handleRequest);

async function handleRequest(request) {
    const url = new URL(request.url);
    let currentValue = value;
    switch (url.pathname) {
        case "/increment":
            currentValue = ++value;
            await Deno.storage.put("value", value);
            break;
        case "/decrement":
            currentValue = --value;
            await Deno.storage.put("value", value);
            break;
        case "/":
            break;
        default:
            return new Response("Not found", { status: 404 });
    }

    return new Response(currentValue);
}
justinmchase commented 2 years ago

This is a great idea in general, I'd be excited to use it.

I will say, without having a clear solution in my mind, it would be great if the complexity of separating this into two different components could be solved, such as by just exporting functions maybe and have the coordinator code be assume so that the server and request/response handling portions were automatic?

export function increment(ctx) {
  // ...
}

export function decrement(ctx) {
}

export default function fallback(ctx) {
}