[LXD] Deadlock on parallel launch and delete

ricab commented 3 years ago

Describe the bug

While using the LXD backend, Multipass entered a deadlock when trying to delete a launching instance. Neither launch nor delete returned.

The stack trace of the deadlocked threads:

Thread 10 (Thread 0x7f6ca37fe700 (LWP 272655)):
#0  0x00007f6cbfa30110 in __lll_lock_wait () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f6cbfa280a3 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x000055db1d2d2e32 in __gthread_mutex_lock (__mutex=0x55db1fa04bd0) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  std::mutex::lock (this=0x55db1fa04bd0) at /usr/include/c++/9/bits/std_mutex.h:100
#4  std::lock_guard<std::mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/9/bits/std_mutex.h:159
#5  multipass::backend::ensure_vm_is_running_for<multipass::LXDVirtualMachine::ensure_vm_is_running(const milliseconds&)::<lambda()>&> (msg=..., is_vm_running=<synthetic pointer>..., virtual_machine=0x55db1fa04b70) at /root/parts/multipass/src/src/platform/backends/shared/shared_backend_utils.h:69
#6  multipass::LXDVirtualMachine::ensure_vm_is_running (this=0x55db1fa04b70, timeout=...) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:317
#7  0x000055db1d2d2fe8 in multipass::LXDVirtualMachine::ensure_vm_is_running (this=<optimized out>) at /usr/include/c++/9/chrono:346
#8  0x000055db1d2ebae8 in std::function<void ()>::operator()() const (this=0x7f6ca37fd410) at /usr/include/c++/9/bits/std_function.h:683
#9  multipass::utils::<lambda()>::operator() (__closure=<synthetic pointer>, __closure=<synthetic pointer>) at /root/parts/multipass/src/src/utils/utils.cpp:223
#10 multipass::utils::try_action_for<multipass::utils::wait_until_ssh_up(multipass::VirtualMachine*, std::chrono::milliseconds, const std::function<void()>&)::<lambda()>&, multipass::utils::wait_until_ssh_up(multipass::VirtualMachine*, std::chrono::milliseconds, const std::function<void()>&)::<lambda()>&> (try_action=<synthetic pointer>..., timeout=..., on_timeout=<synthetic pointer>...) at /root/parts/multipass/src/include/multipass/utils.h:154
#11 multipass::utils::wait_until_ssh_up(multipass::VirtualMachine*, std::chrono::duration<long, std::ratio<1l, 1000l> >, std::function<void ()> const&) (virtual_machine=0x55db1fa04b70, timeout=..., ensure_vm_is_running=...) at /root/parts/multipass/src/src/utils/utils.cpp:245
#12 0x000055db1d2d1162 in multipass::LXDVirtualMachine::wait_until_ssh_up (this=<optimized out>, timeout=...) at /usr/include/c++/9/new:174
#13 0x000055db1d277440 in multipass::Daemon::async_wait_for_ssh_and_start_mounts_for<multipass::LaunchReply> (this=0x7ffff6d2f480, name=..., timeout=..., server=0x7f6c87ffe7c0) at /usr/include/c++/9/chrono:346
#14 0x000055db1d25c1ec in QtConcurrent::StoredMemberFunctionPointerCall3<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, multipass::Daemon, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::chrono::duration<long, std::ratio<1l, 1l> > const&, std::chrono::duration<long, std::ratio<1l, 1l> >, grpc::ServerWriter<multipass::LaunchReply>*, grpc::ServerWriter<multipass::LaunchReply>*>::runFunctor (this=0x7f6c9808a5a0) at /usr/include/x86_64-linux-gnu/qt5/QtConcurrent/qtconcurrentstoredfunctioncall.h:1273
#15 0x000055db1d25d848 in QtConcurrent::RunFunctionTask<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::run (this=0x7f6c9808a5a0) at /usr/include/x86_64-linux-gnu/qt5/QtConcurrent/qtconcurrentrunbase.h:99
#16 0x00007f6cbee8cd62 in QThreadPoolPrivate::stealAndRunRunnable(QRunnable*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#17 0x00007f6cbee91262 in QFutureInterfaceBase::waitForFinished() () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#18 0x000055db1d271c77 in QFuture<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::waitForFinished (this=<optimized out>) at /usr/include/x86_64-linux-gnu/qt5/QtCore/qfuture.h:95
#19 QFutureSynchronizer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::waitForFinished (this=this@entry=0x7f6ca37fd940) at /usr/include/x86_64-linux-gnu/qt5/QtCore/qfuturesynchronizer.h:83
#20 0x000055db1d275c39 in multipass::Daemon::async_wait_for_ready_all<multipass::LaunchReply> (this=0x7ffff6d2f480, server=0x7f6c87ffe7c0, vms=..., timeout=..., status_promise=0x7f6c87ffe690) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:779
#21 0x000055db1d262a58 in QtConcurrent::StoredMemberFunctionPointerCall4<multipass::Daemon::AsyncOperationStatus, multipass::Daemon, grpc::ServerWriter<multipass::LaunchReply>*, grpc::ServerWriter<multipass::LaunchReply>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::chrono::duration<long, std::ratio<1l, 1l> > const&, std::chrono::duration<long, std::ratio<1l, 1l> >, std::promise<grpc::Status>*, std::promise<grpc::Status>*>::runFunctor (this=0x55db1f8a9ef0) at /usr/include/x86_64-linux-gnu/qt5/QtConcurrent/qtconcurrentstoredfunctioncall.h:1635
#22 0x000055db1d262e58 in QtConcurrent::RunFunctionTask<multipass::Daemon::AsyncOperationStatus>::run (this=0x55db1f8a9ef0) at /usr/include/x86_64-linux-gnu/qt5/QtConcurrent/qtconcurrentrunbase.h:99
#23 0x00007f6cbee8cf82 in ?? () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#24 0x00007f6cbee899d2 in ?? () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#25 0x00007f6cbfa25609 in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x00007f6cbe9a8293 in clone () from target:/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f6cbbf6d300 (LWP 239087)):
#0  0x00007f6cbfa30110 in __lll_lock_wait () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f6cbfa280a3 in pthread_mutex_lock () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x000055db1d2d31d0 in __gthread_mutex_lock (__mutex=0x55db1fa04bd0) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  std::mutex::lock (this=0x55db1fa04bd0) at /usr/include/c++/9/bits/std_mutex.h:100
#4  std::unique_lock<std::mutex>::lock (this=0x7ffff6d2d6d0, this=0x7ffff6d2d6d0) at /usr/include/c++/9/bits/unique_lock.h:141
#5  std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7ffff6d2d6d0) at /usr/include/c++/9/bits/unique_lock.h:71
#6  multipass::LXDVirtualMachine::stop (this=0x55db1fa04b70) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:227
#7  0x000055db1d24a74c in multipass::Daemon::delet (this=0x7ffff6d2f480, request=0x7f6c8802ed70, server=<optimized out>, status_promise=0x7f6ca1ffa650) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#8  0x00007f6cbf07dd5a in QObject::event(QEvent*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#9  0x00007f6cbf051917 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#10 0x00007f6cbf0545b8 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#11 0x00007f6cbf0a9f67 in ?? () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#12 0x00007f6cbde8617d in g_main_context_dispatch () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007f6cbde86400 in ?? () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007f6cbde864a3 in g_main_context_iteration () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#15 0x00007f6cbf0a9565 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#16 0x00007f6cbf0504db in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#17 0x000055db1d2ceba1 in (anonymous namespace)::lxd_request_common<multipass::lxd_request(multipass::NetworkAccessManager*, const string&, QUrl, const std::optional<QJsonObject>&, int)::<lambda(QNetworkRequest&, const QByteArray&)>&>(const std::string &, QUrl &, int, multipass::<lambda(QNetworkRequest&, const QByteArray&)> &) (method=..., url=..., timeout=<optimized out>, handle_request=...) at /usr/include/x86_64-linux-gnu/qt5/QtCore/qflags.h:120
#18 0x000055db1d2cf41a in multipass::lxd_request (manager=manager@entry=0x55db1f7d6650, method=..., url=..., json_data=..., timeout=timeout@entry=30000) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_request.cpp:142
#19 0x000055db1d2d248e in (anonymous namespace)::instance_state_for (name=..., manager=0x55db1f7d6650, url=...) at /usr/include/c++/9/bits/basic_string.h:263
#20 0x000055db1d2d2d55 in multipass::LXDVirtualMachine::current_state (this=0x55db1fa04b70) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:271
#21 0x000055db1d2d2e42 in multipass::LXDVirtualMachine::<lambda()>::operator() (__closure=<synthetic pointer>) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:300
#22 multipass::backend::ensure_vm_is_running_for<multipass::LXDVirtualMachine::ensure_vm_is_running(const milliseconds&)::<lambda()>&> (msg=..., is_vm_running=<synthetic pointer>..., virtual_machine=0x55db1fa04b70) at /root/parts/multipass/src/src/platform/backends/shared/shared_backend_utils.h:70
#23 multipass::LXDVirtualMachine::ensure_vm_is_running (this=0x55db1fa04b70, timeout=...) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:317
#24 0x000055db1d2d2fe8 in multipass::LXDVirtualMachine::ensure_vm_is_running (this=this@entry=0x55db1fa04b70) at /usr/include/c++/9/chrono:346
#25 0x000055db1d2d2156 in multipass::backend::<lambda()>::operator() (this=<synthetic pointer>) at /root/parts/multipass/src/src/platform/backends/shared/shared_backend_utils.h:41
#26 multipass::utils::try_action_for<multipass::backend::ip_address_for(multipass::VirtualMachine*, Callable&&, std::chrono::milliseconds) [with Callable = multipass::LXDVirtualMachine::ssh_hostname(std::chrono::milliseconds)::<lambda()>&]::<lambda()>&, multipass::backend::ip_address_for(multipass::VirtualMachine*, Callable&&, std::chrono::milliseconds) [with Callable = multipass::LXDVirtualMachine::ssh_hostname(std::chrono::milliseconds)::<lambda()>&]::<lambda()>&> (try_action=<synthetic pointer>..., timeout=..., on_timeout=<synthetic pointer>...) at /root/parts/multipass/src/include/multipass/utils.h:154
#27 multipass::backend::ip_address_for<multipass::LXDVirtualMachine::ssh_hostname(std::chrono::milliseconds)::<lambda()>&> (get_ip=<synthetic pointer>..., timeout=..., virtual_machine=0x55db1fa04b70) at /root/parts/multipass/src/src/platform/backends/shared/shared_backend_utils.h:60
#28 multipass::LXDVirtualMachine::ssh_hostname[abi:cxx11](std::chrono::duration<long, std::ratio<1l, 1000l> >) (this=0x55db1fa04b70, timeout=...) at /root/parts/multipass/src/src/platform/backends/lxd/lxd_virtual_machine.cpp:329
#29 0x000055db1d2e5961 in multipass::VirtualMachine::ssh_hostname[abi:cxx11]() (this=0x55db1fa04b70) at /usr/include/c++/9/chrono:346
#30 multipass::BaseVirtualMachine::get_all_ipv4[abi:cxx11](multipass::SSHKeyProvider const&) (this=0x55db1fa04b70, key_provider=...) at /root/parts/multipass/src/src/platform/backends/shared/base_virtual_machine.cpp:39
#31 0x000055db1d25586b in multipass::Daemon::list (this=0x7ffff6d2f480, request=0x7f6c90039df0, server=0x7f6ca17f97c0, status_promise=0x7f6ca17f9650) at /usr/include/c++/9/bits/unique_ptr.h:360
#32 0x00007f6cbf07dd5a in QObject::event(QEvent*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#33 0x00007f6cbf051917 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#34 0x00007f6cbf0545b8 in QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#35 0x00007f6cbf0a9f67 in ?? () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#36 0x00007f6cbde8617d in g_main_context_dispatch () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#37 0x00007f6cbde86400 in ?? () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#38 0x00007f6cbde864a3 in g_main_context_iteration () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#39 0x00007f6cbf0a9583 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#40 0x00007f6cbf0504db in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#41 0x00007f6cbf058246 in QCoreApplication::exec() () from target:/snap/multipass/4481/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#42 0x000055db1d22f87f in (anonymous namespace)::main_impl (argc=<optimized out>, argv=<optimized out>) at /root/parts/multipass/src/src/daemon/daemon_main.cpp:127
#43 0x000055db1d226e08 in std::__invoke_impl<int, int (&)(int, char**), int&, char**&> (__f=@0x55db1d22f0b0: {int (int, char **)} 0x55db1d22f0b0 <(anonymous namespace)::main_impl(int, char**)>) at /usr/include/c++/9/bits/invoke.h:89
#44 std::__invoke<int (&)(int, char**), int&, char**&> (__fn=@0x55db1d22f0b0: {int (int, char **)} 0x55db1d22f0b0 <(anonymous namespace)::main_impl(int, char**)>) at /usr/include/c++/9/bits/invoke.h:95
#45 std::invoke<int (&)(int, char**), int&, char**&> (__fn=@0x55db1d22f0b0: {int (int, char **)} 0x55db1d22f0b0 <(anonymous namespace)::main_impl(int, char**)>) at /usr/include/c++/9/functional:81
#46 multipass::top_catch_all<int, int (&)(int, char**), int&, char**&> (f=@0x55db1d22f0b0: {int (int, char **)} 0x55db1d22f0b0 <(anonymous namespace)::main_impl(int, char**)>, fallback_return=<optimized out>, log_category=...) at /root/parts/multipass/src/include/multipass/top_catch_all.h:84
#47 main (argc=<optimized out>, argv=<optimized out>) at /root/parts/multipass/src/src/daemon/daemon_main.cpp:136

Other threads: https://paste.ubuntu.com/p/cFXwZNsZBB/

To Reproduce

This is not deterministic, but the way I got it was:

multipass set local.driver=lxd
multipass launch -n foo in one terminal
multipass delete -p foo in another, after a while

Expected behavior Multipass would not block, even if it might have to fail one of the commands.

Additional info This happened while testing a PR:

$ multipass version
multipass  1.7.0-dev.769.pr2083+ge490bcef
multipassd 1.7.0-dev.769.pr2083+ge490bcef

surahman commented 3 years ago

Deadlocks typically happen when all four situations are satisfied:

Mutual exclusion
Hold and wait
No preemption
Circular wait

Without having looked at the code, and from strictly from the stack trace, it appears that both multipass::LXDVirtualMachine::ensure_vm_is_running /src/platform/backends/lxd/lxd_virtual_machine.cpp:317 multipass::LXDVirtualMachine::stop /src/platform/backends/lxd/lxd_virtual_machine.cpp:227 are waiting on the locks which the opposing thread holds; a circular wait.

My suggested band-aid solution would be a modified spinlock; try and acquire the lock three times with an n second wait between attempts. On the third failed attempt hard-fail the LWP after releasing all locks.

A more robust solution would be to follow a Distributed and Data Systems design concept called Strict 2-Phase Locking. You would couple this by assigning each resource lock a unique integer priority, acquiring all locks in increasing priority, and releasing in decreasing priority.

Another thing to consider is lock granularity; are they too coarse or fine in the resources they encompass?

Tweaking these may reduce concurrency throughput. This is just my two cents from experience and academics.

ricab commented 3 years ago

Thanks for the input @surahman.

Deadlocks typically happen when all four situations are satisfied:
* Mutual exclusion
* Hold and wait
* No preemption
* Circular wait

Good summary.

Without having looked at the code, and from strictly from the stack trace, it appears that both multipass::LXDVirtualMachine::ensure_vm_is_running /src/platform/backends/lxd/lxd_virtual_machine.cpp:317 multipass::LXDVirtualMachine::stop /src/platform/backends/lxd/lxd_virtual_machine.cpp:227 are waiting on the locks which the opposing thread holds; a circular wait.

My suggested band-aid solution would be a modified spinlock; try and acquire the lock three times with an n second wait between attempts. On the third failed attempt hard-fail the LWP after releasing all locks.

I'm not sure this is the sort of thing that would merit a band-aid approach. Probably better fix properly when we can.

A more robust solution would be to follow a Distributed and Data Systems design concept called Strict 2-Phase Locking. You would couple this by assigning each resource lock a unique integer priority, acquiring all locks in increasing priority, and releasing in decreasing priority.

Another thing to consider is lock granularity; are they too coarse or fine in the resources they encompass?

Tweaking these may reduce concurrency throughput. This is just my two cents from experience and academics.

All good ideas to jump-start dealing with this, when someone picks it up. I would add the preference of "making it as simple as possible, but no simpler", so probably look first at granularity/scope, circular dependencies, hold and wait.

surahman commented 3 years ago

I'm not sure this is the sort of thing that would merit a band-aid approach. Probably better fix properly when we can. I agree, it is like putting a finger over a crack on a wall of a dam.

This is a very interesting bug to follow along with. I wish I was more comfortable with the Multipass architecture and codebase to try and plug it. I do not feel like I fully grasp where all the resources are tied together in the code, yet 😉 .

surahman commented 3 years ago

I had quick look around in the code and it seems as though both threads are waiting on the state_mutex for the virtual machine in question. This means that another thread is holding the state_mutex and is either running or itself waiting on a lock.

Thread 8 is running mp::MetricsProvider::MetricsProvider, holding the metrics_mutex, and waiting on a CV. The other threads seem to be running some system-level request operations and dealing with RPC calls. My guess, without a deeper knowledge of the architecture, is that the MetricsProvider in thread 8 might be holding, or have instigated the holding of, the state_mutex. That is just my guess based on the name of the routine and what the mutex is supposed to guard. That said if it has instigated the holding of a lock in another thread we should be able to discern it from the stack trace. I do not see anything like that in the stack trace.

townsend2010 commented 3 years ago

Hey @surahman,

I appreciate you digging into this, but I've deeply investigated this in the past and have an understanding of what the problem is. It's just I need time to fix it or just remove the whole ability to force an instance to delete before launch is complete.

surahman commented 3 years ago

This issue definitely requires a deep understanding of what is going on architecturally in Multipass, and is as such most certainly out of my depth. I am following it to hone my skill set and because it is interesting to dig into and try to decipher what is going on.

canonical / multipass

[LXD] Deadlock on parallel launch and delete #2111