Redesign metrics system with thread-safety in mind

alindima commented 2 years ago

The metrics system is not fully thread-safe at the moment, due to some issues:

IncMetrics inner state is mutated on serialisation. This causes race conditions when the write() function is called from multiple threads. See: https://github.com/firecracker-microvm/firecracker/pull/2893
While SharedIncMetrics use atomics, they always use Relaxed ordering. While on x86 memory access has Acquire-Release semantics, on Arm this is not the case. Hence, the process of writing metrics to file may use outdated values.
Metrics are written from the signal handler, which may cause a deadlock if a thread is preempted by a signal while it was holding the metrics file lock.

Problems 1 and 3 can be fully solved by removing metrics usage from the signal handler. One option here is to have a special file used for logging the exit reason and the latest metrics values, similar to a coredump. We should also enforce that METRICS.write() is called from a single thread (and therefore removing the lazy_static declaration).

Problem 2 could be solved in two ways: by using tighter ordering constraints (need some further dive deep and may incur some overhead due to CPU reordering constraints and prevention of certain compiler optimisations) or by redesigning the metrics system to use per-thread values (this would also solve problem 1).

Another thing to keep in mind is potential need for having device-specific instances of a metric. For example, METRICS.net.tx_bytes_count may have sense to be reported per-device instance instead of being aggregated.

alindima commented 2 years ago

Related to https://github.com/firecracker-microvm/firecracker/issues/1759. In order to remove the lazy_static usage we'd need to use per-component metrics. One option for the design is the one linked in the issue.

acatangiu commented 2 years ago

+1 on each instance of each component to have its own metrics (so one could see how many packets a particular net device 2 has sent, for example).

Unfortunately, on the signal handler topic, this is not a fix, and will actually make things even worse (more constrained).

Currently, there is a issue of potential deadlock on a global metrics mutex, that can be retried in a signal handler even if it is already taken by same thread in normal (non-signal handler) context. Moving the metrics from single global object to multiple individual components ultimately means more individual mutexes that can also deadlock if there is a path to them from signal handler.

E.g.: doing something like vmm.flush_all_metrics() involves iteratively locking each device to flush their metrics. Calling that in a signal handler results in a high chance of signal coming in while there is some emulation code running under some device's held lock, resulting in guaranteed deadlock.

Safest thing to do is just not flush/print metrics in signal handlers :stuck_out_tongue:

acatangiu commented 2 years ago

Metrics themselves don't need locks since they're atomics, so a solution would also be to decouple the writer from the metric implementation.

Then you can use different writers for different contexts with no locking required between them. Or use a single global writer wrapped in a reentrant mutex. Or some other custom approach, the point being to outsource the challenge outside of the metrics implementation.

raduiliescu commented 2 years ago

You definitely need to flush metrics, or at least one metric in a signal handler. Is important to not lose critical events as seccomp failures. But one can have multiple write/dump functions. One to be called in normal context where everything is fine, and one for signal handlers where the code is in "emergency" situation.

Also I am a bit worried on the performance impact of atomic metrics. As discussed on the issue mentioned above, atomics for metrics on the hotpaths might add a degradation. Depending on architecture specifics, there we need to think alternatives like per thread metrics.

alindima commented 2 years ago

Metrics themselves don't need locks since they're atomics, so a solution would also be to decouple the writer from the metric implementation.

This is similar to what I was proposing here (actually an idea from @alsrdn):

Problems 1 and 3 can be fully solved by removing metrics usage from the signal handler. One option here is to have a special file used for logging the exit reason and the latest metrics values, similar to a coredump.

The metrics lock is only for writing to the file. If we have another file that is used from the signal handler, then this problem is solved.

It is then a challenge to make the metrics system available to the signal handler without using a global variable (like with lazy_static). If we're keen on doing any metrics flushing from the signal handler I think we're stuck to using some globally accessible object, which may be fine if we solve the thread-safety issues.

In a nutshell, there are two big problems to be solved here:

Making the metrics thread-safe or making them flushable only from one thread.
Fixing the deadlock potential (which I believe we may only ever solve by writing a coredump-like file from the signal handler).

And indeed the per-component metrics add trouble, unless we have a sophisticated design where each component propagates the metric update to its parent whenever it becomes available (instead triggering it on a specific flush event).

But I don't really see how we can use per-component metrics in the signal handler anyway.

alindima commented 2 years ago

Also I am a bit worried on the performance impact of atomic metrics. As discussed on the issue mentioned above, atomics for metrics on the hotpaths might add a degradation. Depending on architecture specifics, there we need to think alternatives like per thread metrics.

This is indeed a nice approach 👍🏻 It would also remove the race condition potential from the metrics serialisation, since at no point could two threads operate on the same metric value

xmarcalx commented 1 month ago

We removed the task for the roadmap because we are currently not planning to work on it due to higher priority tasks. However we split #4709 from this task, which will help to define the first stepping stone toward the more broad refactor proposed in this issue.

firecracker-microvm / firecracker

Redesign metrics system with thread-safety in mind #2899