Open ricab opened 3 years ago
Deadlocks typically happen when all four situations are satisfied:
Without having looked at the code, and from strictly from the stack trace, it appears that both
multipass::LXDVirtualMachine::ensure_vm_is_running /src/platform/backends/lxd/lxd_virtual_machine.cpp:317
multipass::LXDVirtualMachine::stop /src/platform/backends/lxd/lxd_virtual_machine.cpp:227
are waiting on the locks which the opposing thread holds; a circular wait.
My suggested band-aid solution would be a modified spinlock; try and acquire the lock three times with an n
second wait between attempts. On the third failed attempt hard-fail the LWP after releasing all locks.
A more robust solution would be to follow a Distributed and Data Systems design concept called Strict 2-Phase Locking
. You would couple this by assigning each resource lock a unique integer priority, acquiring all locks in increasing priority, and releasing in decreasing priority.
Another thing to consider is lock granularity; are they too coarse or fine in the resources they encompass?
Tweaking these may reduce concurrency throughput. This is just my two cents from experience and academics.
Thanks for the input @surahman.
Deadlocks typically happen when all four situations are satisfied:
* Mutual exclusion * Hold and wait * No preemption * Circular wait
Good summary.
Without having looked at the code, and from strictly from the stack trace, it appears that both
multipass::LXDVirtualMachine::ensure_vm_is_running /src/platform/backends/lxd/lxd_virtual_machine.cpp:317
multipass::LXDVirtualMachine::stop /src/platform/backends/lxd/lxd_virtual_machine.cpp:227
are waiting on the locks which the opposing thread holds; a circular wait.My suggested band-aid solution would be a modified spinlock; try and acquire the lock three times with an
n
second wait between attempts. On the third failed attempt hard-fail the LWP after releasing all locks.
I'm not sure this is the sort of thing that would merit a band-aid approach. Probably better fix properly when we can.
A more robust solution would be to follow a Distributed and Data Systems design concept called
Strict 2-Phase Locking
. You would couple this by assigning each resource lock a unique integer priority, acquiring all locks in increasing priority, and releasing in decreasing priority.Another thing to consider is lock granularity; are they too coarse or fine in the resources they encompass?
Tweaking these may reduce concurrency throughput. This is just my two cents from experience and academics.
All good ideas to jump-start dealing with this, when someone picks it up. I would add the preference of "making it as simple as possible, but no simpler", so probably look first at granularity/scope, circular dependencies, hold and wait.
I'm not sure this is the sort of thing that would merit a band-aid approach. Probably better fix properly when we can. I agree, it is like putting a finger over a crack on a wall of a dam.
This is a very interesting bug to follow along with. I wish I was more comfortable with the Multipass architecture and codebase to try and plug it. I do not feel like I fully grasp where all the resources are tied together in the code, yet 😉 .
I had quick look around in the code and it seems as though both threads are waiting on the state_mutex
for the virtual machine in question. This means that another thread is holding the state_mutex
and is either running or itself waiting on a lock.
Thread 8 is running mp::MetricsProvider::MetricsProvider
, holding the metrics_mutex
, and waiting on a CV
. The other threads seem to be running some system-level request operations and dealing with RPC calls. My guess, without a deeper knowledge of the architecture, is that the MetricsProvider
in thread 8 might be holding, or have instigated the holding of, the state_mutex
. That is just my guess based on the name of the routine and what the mutex is supposed to guard. That said if it has instigated the holding of a lock in another thread we should be able to discern it from the stack trace. I do not see anything like that in the stack trace.
Hey @surahman,
I appreciate you digging into this, but I've deeply investigated this in the past and have an understanding of what the problem is. It's just I need time to fix it or just remove the whole ability to force an instance to delete before launch
is complete.
This issue definitely requires a deep understanding of what is going on architecturally in Multipass, and is as such most certainly out of my depth. I am following it to hone my skill set and because it is interesting to dig into and try to decipher what is going on.
Describe the bug
While using the LXD backend, Multipass entered a deadlock when trying to delete a launching instance. Neither
launch
nordelete
returned.The stack trace of the deadlocked threads:
Other threads: https://paste.ubuntu.com/p/cFXwZNsZBB/
To Reproduce
This is not deterministic, but the way I got it was:
multipass set local.driver=lxd
multipass launch -n foo
in one terminalmultipass delete -p foo
in another, after a whileExpected behavior Multipass would not block, even if it might have to fail one of the commands.
Additional info This happened while testing a PR: