WebAssembly / wasi-cloud-core

50 stars 4 forks source link

Elaboration of user scenarios to ensure we are evolving the right direction or directions #5

Open squillace opened 1 year ago

squillace commented 1 year ago

I was chatting with @lukewagner and then some others including @mikkelhegn about use cases that have different semantics and whether that would have an impact on the evolution of the proposal. So I'm opening an issue here to start sketching out some of the ones we at MS had from our customers. If this should reside as a discussion, or should be a PR, please let me know.

Mossaka commented 1 year ago

I'd prefer to collect user scenarios in a directory, perhaps named design/UseCases.md. However, any rough sketches or thoughts are welcome in this issue.

squillace commented 1 year ago

I can think of at least three categories of behavior. The thing that isn't clear to me is whether there's a substantial impact on the surface of the proposal, or merely the behavior implied.

  1. superfast (< 1 second), mainly processing, the kind of serverless that executes in CDN compute.
  2. Generalized serverless, executing long enough to reach mostly any service (even quite remote), but not lying around longer than 5-10 minutes. These execute as function triggers in microservices scenarios without involving "reliability" or "durability", words that imply very long-term waiting for other behavior, and usually involve some sort of serialization/deserialization.
  3. Durable or "long-running" serverless functions.

What I want to do is flesh out what each means a) for the surfaces we are defining (I know there's one for 1, above, which is caching) and b) whether these require a surface change or rather a annotation/hint that the behavior for a component is different though the apis are the same.

@lukewagner I'm going to provoke you here. regarding example 1, I'd love your thoughts -- and those of anyone else. I am personally interested in scenarios 1 and 2.

There is also potentially a fourth scenario, because the above assume some sort of laptop-sized-to-cloud-server-sized resources are available. But on smaller edge compute scenarios, power and resources may be limited, which may mean that 1 runs vastly slower on a small device or SoC, but is still expected to run "as fast" as it can.

lukewagner commented 1 year ago

I think your scenarios 1-3 are all important ones that we should care about supporting. Incidentally, even in a CDN/Edge scenario, 3 is relevant (because WebSockets and long-polling). In addition to these user scenarios, I'd like to suggest 3 possible "user requirements" that we should take into account (which are partial and overlap, so I don't know if they count as "user scenarios" or whether this what you're asking for... but hey, I've been provoked! ;-):

First: while your "superfast" scenario 1 demands fast cold-start, your scenarios 2 and 3 are also very much empowered by fast cold-start because, while the overall computation may run minutes or months, the latency of the first response to the client from this longer-running computation may be part of a user-interactive loop, and now milliseconds matter again (especially once these services start getting chained so that cold-starts compound). In such cases, cold-start can be the difference between "I can use this for a frontend HTTP API" and "I need to keep a warm always-running container".

Based on this, a first possible user requirement is: "Needs fast cold-start".

Second: an important cross-cutting concern is whether the code is intended to be run in a geo-distributed (e.g. multi-availability-zone or edge) manner, in which case it's really important for the developer to know which WASI operations have predictably low-millisecond latency and which require a global roundtrip and thus may take hundreds of milliseconds each and thus must be done either not-at-all or carefully. If we're trying to define portable interfaces that provide developers/operators more freedom to move code around, it's important that we have these developer latency expectations represented explicitly (e.g., implied by the WASI interface name) so that when moving code around, we're not catastrophically tanking performance in a way that we only find about in production).

Based on this, a second possible user requirement is "Geo-distributed, but still needs predictable performance".

Third: while some use-cases simply need very long constantly-running computations, where it's just the platform time limit that needs to be relaxed, many long-running functions are mostly idle, so you don't want to simply relax the time limit and keep the instance live in memory the whole time, as this will be expensive (to someone). Instead, I think the key technical requirement is to be able to dehydrate/rehydrate the instance during periods of idleness. It's tempting to say that the platform should do this automagically, but I think this will end up being a pretty flaky/leaky abstraction in practice (consider: the total size of linear memory vs the size of actually-necessary persistent state, how does live upgrading long-running code work when the serialized linear memory needs to be compatible with the new code?, resource handles). Thus, I think we'll need some sort of explicit WASI interface by which, during periods of idleness, the host asks the guest to produce a serializable blob of data to later give back, allowing the instance to be destroyed in the interim.

Based on this, a third possible user requirement is: "Long-running, but mostly-idle and needs to pay accordingly".

Mossaka commented 8 months ago

Anyone else feel free to drop comments here for user scenarios.

I will leave this issue open for another 2 weeks and then I will create a folder to list them in the repo.

thomastaylor312 commented 8 months ago

I think @lukewagner's comment covered most of my thoughts. I'll offer just a little bit more in my own words:

My last recommendation is we really keep these use cases scoped to the use cases for the interfaces. There are a lot of thorny and/or interesting technical problems underneath these, but having the common set of capabilities that a large chunk of applications need is the most important thing. Based on past experience, plus what platforms like Cloudflare, Fastly, Lambda, Azure Functions, etc. have, those things are:

There could be other things in the future like document DB style stuff, but that is probably best to add when we get to round 2 and have more feedback from the community.

Anyway, hopefully that helps! Let me know if any of that didn't make sense or needs clarifications