Runtime stats for Wasmtime

acfoltzer commented 3 years ago

Feature

To use Wasmtime in production, we will need to gather statistics from the runtime to feed into monitoring systems.

We have inserted some metrics in a private fork of Lucet in order to support our current production use, and while this has proven useful, we would like to have first-class, open source support for such things in Wasmtime.

Benefit

The ability to monitor the runtime performance and load of Wasmtime is a requirement for it to be used in many production environments.

Implementation

I'll describe the stats we gather from Lucet. I'm less sure how best to fit them into the Wasmtime API, but I have described the kind of callback interface I would like to provide as a client of Wasmtime.

The ones in bold are ones that we've found very important for monitoring platform health and performance. The others would be nice to have, but less critical. This list also shouldn't rule out other opportunities for stat gathering, this is solely what we've found useful in Lucet.

Counters

Sometimes we just need to count how many times something has happened. For Lucet, this is a handful of internal error conditions that we are able to handle without presenting an error to the end user, but want to keep track of internally nonetheless. There is certainly room for more of these:

Number of retries needed on userfaultfd operations due to ENOENT errors
Number of EEXIST errors that the userfaultfd fault handler saw and tracked
Number of times the userfaultfd fault handler got a read event on its file descriptor, but was not able to read an event.

This would be a simple callback per event:

fn record_event(&self) {
    // Bump a counter
}

Gauges and timers

For measuring operations with distinct start and end points, we use gauges and timing histograms. A gauge is a number that, if incremented, usually has a corresponding decrement at some point in the future. A timer adds a timing component to this, so that the time between the beginning and end of an operation can be measured. These could be implemented with a callback that returns an RAII-style guard:

fn start_operation(&self) -> Guard {
    // Increment gauge
    // Create `Guard` with initial timestamp
}

struct Guard;

impl Guard {
    fn finish(self) {
        // The operation finished normally, so record timing information
    }
}

impl Drop for Guard {
    fn drop(&mut self) {
        // Decrement gauge
        // Optionally record timing information
    }
}

Using the drop for the gauge provides some assurance that the gauge remains accurate even if an error occurs between the start and end of an operation. For the timing information, though, we do not necessarily want the timing of errors to be recorded, so having an explicit finish method lets us know that the operation was successful.

In Lucet we currently use gauges and timers to measure:

Evaluating a future on behalf of a Wasm program (similarly to RFC 2)
Instantiating a module (setting up memory protections, copying in initial heap values)
Freeing an instance (resetting memory protections, freeing other resources)
Expanding a Wasm heap on behalf of an instance
Acquiring an instance slot from a memory region
Returning a freed instance slot to a region

Alternatives

Most of the stats we gather for production are taken from outside the boundaries of the Lucet runtime API. To the extent that these operations can be exposed as discrete steps that the library client could measure them, we do not need to add invasive stats interfaces. The stats described here are the ones where in Lucet a significant API refactoring would be required to expose as discrete measurable operations, and would potentially be undesirable for safety or ergonomics.

Instead of a callback-based approach, we could maintain stats internally within Wasmtime and let them be queried by the embedding application. This would put more of a maintenance and design burden on Wasmtime, however, and would limit the flexibility of the client's stat-gathering interfaces.

peterhuene commented 3 years ago

Have we looked at the metrics crate as a possible way to generically instrument wasmtime and wasmtime-runtime?

acfoltzer commented 3 years ago

I did not know about that crate! That looks very interesting indeed

acfoltzer commented 3 years ago

It looks like metrics ticks many of the boxes for the requirements we have, but it doesn't appear to have an RAII interface for gauges. In practice we have found that to be very useful to avoid missing decrements due to surprise control flow. Maybe they'd be open to an upstream contribution, though?

bytecodealliance / wasmtime