cloudflare / workerd

The JavaScript / Wasm runtime that powers Cloudflare Workers
https://blog.cloudflare.com/workerd-open-source-workers-runtime/
Apache License 2.0
6.17k stars 293 forks source link

🐛 `blockConcurrencyWhile` doesn't guarantee response ordering #2714

Open threepointone opened 3 weeks ago

threepointone commented 3 weeks ago

Repro at https://github.com/threepointone/durable-ordering

The worker has a Durable Object, inside which, inside a blockConcurrencyWhile call, it sleeps for a random period of time between 0 and 400ms and then increments a counter. It persists the counter, then returns it.

const count = await this.ctx.blockConcurrencyWhile(async () => {
  await sleep(Math.random() * 400);
  const count = (await this.ctx.storage.get<number | null>("count")) || 0;
  await this.ctx.storage.put("count", count + 1);
  return count + 1;
});
return new Response(count.toString());

The test makes 40 requests to the worker, with a 50ms delay between each request. It then checks whether the counter is incremented correctly.

If blockConcurrencyWhile actually blocks request concurrency, then the counter should be incremented correctly. However, we see that it doesn't. Inside the Durable Object itself, we seem to be processing requests in order, but the reponses don't come in order. What is the expected behaviour?

Example trial run:

Running test, this may take a few seconds...
Final counter array: [
  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 18, 17, 16, 15, 14, 20, 22, 19, 24,
  23, 28, 30, 26, 33, 21, 31, 25, 35, 27, 38, 39, 32, 40, 29, 37, 34, 36
]
Mismatches found: 24 at indices: [
  13, 14, 16, 17, 18, 19, 20, 21, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
  36, 37, 38, 39
]
threepointone commented 3 weeks ago

Internal thread: https://chat.google.com/room/AAAAXqgBsU4/IgkUycTlxSI/IgkUycTlxSI?cls=10

kentonv commented 3 weeks ago

We've never made any sort of guarantees about response ordering (and Cap'n Proto doesn't either). That said I agree this is surprising. I think what's happening is that when one request completes a blockConcurrentlyWhile(), and another request is waiting, we sometimes switch directly to the waiting request rather than finishing out the request that actually did the blockConcurrencyWhile(). This is not incorrect per se, but it'd probably be better if we would continue in the calling request first, until it actually awaits something, before switching requests.

Frederik-Baetens commented 3 weeks ago

we sometimes switch directly to the waiting request rather than finishing out the request that actually did the blockConcurrencyWhile().

Isn't this pretty much guaranteed to happen because the requests need to wait for what is sometimes upwards of 100ms for the output gate to lift? It doesn't seem feasible to me to wait for the output gate to lift before continuing with the processing of other requests.

kentonv commented 3 weeks ago

@Frederik-Baetens The output gate will release responses in the same order as they completed in JavaScript, so it doesn't have any effect here except to delay things.

Frederik-Baetens commented 3 weeks ago

Cool, I didn't know that. Do you think the write coalescing could contribute here? I thought that the output gate together with write coalescing could result in a few requests having their output gates lifted simultaneously (respecting completion order), after which a few ms of network jitter should be enough to affect response order?

kentonv commented 3 weeks ago

Oh... I just looked closer at the test and realized that the for loop doing 40 requests is actually in a separate node.js client process.

Yeah there's absolutely no expectation of consistent response ordering here since every request is on a different connection. Even if workers responded exactly in order, the network could arbitrarily delay one connection against another.

how can we guarantee causal ordering on the client side?

What do you mean by "causal ordering" here? I don't think "causal ordering" is ever expected across overlapping requests?

kentonv commented 3 weeks ago

I thought that the output gate together with write coalescing could result in a few requests having their output gates lifted simultaneously (respecting completion order), after which a few ms of network jitter should be enough to affect response order?

It's true that write coalescing + output gates could cause responses to be sent in batches, and I guess I'm not sure whether these batches will end up retaining their original order.