fermyon / spin

Spin is the open source developer tool for building and running serverless applications powered by WebAssembly.
https://developer.fermyon.com/spin
Apache License 2.0
5.15k stars 247 forks source link

Improving observability of Spin #2293

Closed calebschoepp closed 3 months ago

calebschoepp commented 6 months ago

Observability is critical for a great developer experience. We should work to improve the observability of Spin, but that is a very vague statement. What exactly are we improving the observability of? Spin itself? Spin apps?

This issue is meant to act as a meta-issue that clarifies what we mean by "improving observability of Spin". It will provide a lay of the land by describing the different levels of observability within Spin that we want to improve. Other issues, SIPs, and PRs will be used to track the actual work of improving the observability and they can backlink to this meta-issue.

Before we dive in I want to note that OpenTelemetry has become the industry standard for observability data and is the standard we would want to conform to.

Types of observability in Spin

I propose that there are four types of observability in Spin that we want to enable. They exist on a spectrum from host-focused to guest-focused.

1) Runtime observability — observing the Spin runtime itself

Developers operating Spin in a production environment want observability into the state of the Spin process itself. This would include among other things:

Some notable non-requirements include:

2) Trigger observability — observing the requests made to Spin applications

Developers want observability into the requests that are made to their Spin application. This would include among other things:

3) Component observability — observing the interaction between composed components

Developers will create their Spin applications from a composition of components. Ideally we can automatically emit spans as the component composition graph is traversed and components are executed. This would include among other things:

This would require upstream modifications in Wasmtime.

4) Guest observability — observing the code within the guest module

Developers want to be able to instrument their own guest code. This allows them to emit telemetry with spans, metadata, and metrics unique to their own use case. We are reliant on the upstream WASI Observe proposal to make this happen. The upstream proposal has the clearest definition of requirements, but briefly for Spin to act as a host implementation we would require:

Other observability related things

Here are some other observability related things we might want to do to make the experience better in Spin.

Streamline the process of collecting and viewing the observability data

The four types of observability outlined in the above section all just emit telemetry and expect that there is a collector running somewhere to collect the data. It would be good clearly document the process of running a collector for any users who don't already use a specific collector in their environment.

We could take this one step further if we wanted and build this collector into Spin (or a plugin or an app like KV explorer) if we really wanted to streamline the experience.

Create an observability standard that other Spin runtimes can match

Spin is not the only Spin runtime. Observability should be implemented into Spin such that other Spin runtimes can follow suit too.

Prior art

calebschoepp commented 6 months ago

Here is an example of what a trace might look when levels 2 through 4 are combined.

Untitled (1)
calebschoepp commented 6 months ago

Trigger observability seems like the most tractable and immediately problem so I'm going to get started on a SIP for how we could implement it.

macolso commented 6 months ago

Question for my own understanding: is CPU / memory utilization considered a runtime or guest metric? For example, Azure Application Insights emits a metric called Process CPU, which shows how much of the total processor capacity is consumed by the process that is hosting your monitored app. I would consider Application Insights a tool for guest observability so this seems like a grey area.

calebschoepp commented 6 months ago

Question for my own understanding: is CPU / memory utilization considered a runtime or guest metric? For example, Azure Application Insights emits a metric called Process CPU, which shows how much of the total processor capacity is consumed by the process that is hosting your monitored app. I would consider Application Insights a tool for guest observability so this seems like a grey area.

I suppose it could be considered both. We might want to emit CPU/Memory utilization from the trigger observability i.e. how much CPU/Memory did an invocation of an app use. This would be considered guest metrics. Someone could also use an agent on the node to collect the CPU/Memory utilization of Spin itself and this would be a runtime metric.

I'm not really sure if this answers your question though because your question seems specific to the semantics of App Insights which I'm not really familiar with.

calebschoepp commented 6 months ago

@rylev had a good suggestion that we should make sure to clearly document our patterns around spans e.g. how do we name them, what metadata do they have, when should we emit them. That way the traces that get created can be more consistent and useful.

https://github.com/open-telemetry/opentelemetry-specification/blob/v1.26.0/specification/trace/api.md#span

calebschoepp commented 6 months ago

Seeing as this is a meta-issue tracking a lot of work I'm wondering if it shouldn't be set in progress. @vdice what do you think?

vdice commented 6 months ago

@calebschoepp 👍 Sounds good. Thanks!

agardnerIT commented 5 months ago

As someone with Observability experience and a CNCF ambassador, please LMK if I can assist here. I am happy to act in a vendor-neutral consultant role.

lann commented 5 months ago

@agardnerIT Thanks! The most recent work in progress is at https://github.com/fermyon/spin/pull/2398 if you are interested in following.

calebschoepp commented 3 months ago

This work is sufficiently far along that I'm closing this initial ticket.