Open acfoltzer opened 3 years ago
Have we looked at the metrics crate as a possible way to generically instrument wasmtime
and wasmtime-runtime
?
I did not know about that crate! That looks very interesting indeed
It looks like metrics
ticks many of the boxes for the requirements we have, but it doesn't appear to have an RAII interface for gauges. In practice we have found that to be very useful to avoid missing decrements due to surprise control flow. Maybe they'd be open to an upstream contribution, though?
Feature
To use Wasmtime in production, we will need to gather statistics from the runtime to feed into monitoring systems.
We have inserted some metrics in a private fork of Lucet in order to support our current production use, and while this has proven useful, we would like to have first-class, open source support for such things in Wasmtime.
Benefit
The ability to monitor the runtime performance and load of Wasmtime is a requirement for it to be used in many production environments.
Implementation
I'll describe the stats we gather from Lucet. I'm less sure how best to fit them into the Wasmtime API, but I have described the kind of callback interface I would like to provide as a client of Wasmtime.
The ones in bold are ones that we've found very important for monitoring platform health and performance. The others would be nice to have, but less critical. This list also shouldn't rule out other opportunities for stat gathering, this is solely what we've found useful in Lucet.
Counters
Sometimes we just need to count how many times something has happened. For Lucet, this is a handful of internal error conditions that we are able to handle without presenting an error to the end user, but want to keep track of internally nonetheless. There is certainly room for more of these:
userfaultfd
operations due toENOENT
errorsEEXIST
errors that theuserfaultfd
fault handler saw and trackeduserfaultfd
fault handler got a read event on its file descriptor, but was not able to read an event.This would be a simple callback per event:
Gauges and timers
For measuring operations with distinct start and end points, we use gauges and timing histograms. A gauge is a number that, if incremented, usually has a corresponding decrement at some point in the future. A timer adds a timing component to this, so that the time between the beginning and end of an operation can be measured. These could be implemented with a callback that returns an RAII-style guard:
Using the drop for the gauge provides some assurance that the gauge remains accurate even if an error occurs between the start and end of an operation. For the timing information, though, we do not necessarily want the timing of errors to be recorded, so having an explicit
finish
method lets us know that the operation was successful.In Lucet we currently use gauges and timers to measure:
Alternatives
Most of the stats we gather for production are taken from outside the boundaries of the Lucet runtime API. To the extent that these operations can be exposed as discrete steps that the library client could measure them, we do not need to add invasive stats interfaces. The stats described here are the ones where in Lucet a significant API refactoring would be required to expose as discrete measurable operations, and would potentially be undesirable for safety or ergonomics.
Instead of a callback-based approach, we could maintain stats internally within Wasmtime and let them be queried by the embedding application. This would put more of a maintenance and design burden on Wasmtime, however, and would limit the flexibility of the client's stat-gathering interfaces.