DeadLock with the event Notification

Satheesh-satz commented 7 years ago

Hi ,

We kind of getting into a deadlock situation, when we tried to subscribe to a bunch of events by calling the function "AdsSyncAddDeviceNotificationReqEx" with the "Mutex A" locked; which internally called the NotificationDispatcher::Emplace function which uses another mutex "std::lock_guard lock(mutex)" to synchronize between "NotificationDispatcher::Run()", 'NotificationDispatcher::Erase" and "NotificationDispatcher::Emplace"

But the deadlock occurs here when an event is received during the call to subscribe and the registered event call back function is called with "std::lock_guard lock(mutex)" locked inside the "NotificationDispatcher::Run()" function and waiting for the call back to return which in turn requires the "Mutex A" to do something before returning.

We tried to get out of this issue by releasing the mutex "std::lock_guard lock(mutex)" before calling the callback function (code attached) and that seems to be working fine. Could you please look into this issue and release an update with this fix.

Code Snippet

std::map<uint32_t, std::shared_ptr<Notification> > notifications;
void NotificationDispatcher::Emplace(uint32_t hNotify, Notification& notification)
{
   LOG_ERROR("From Inside NotificationDispatcher::Emplace before trying to get the mutex lock");
    std::lock_guard<std::recursive_mutex> lock(mutex);
    LOG_ERROR("From Inside NotificationDispatcher::Emplace After lock has acquired");
    notifications.emplace(hNotify, std::make_shared<Notification>(notification));
}

void NotificationDispatcher::Run()
{
   while (sem.Wait())
   {
      const auto length = ring.ReadFromLittleEndian<uint32_t>();
      (void)length;
      const auto numStamps = ring.ReadFromLittleEndian<uint32_t>();
      for (uint32_t stamp = 0; stamp < numStamps; ++stamp) {
         const auto timestamp = ring.ReadFromLittleEndian<uint64_t>();
         const auto numSamples = ring.ReadFromLittleEndian<uint32_t>();
         for (uint32_t sample = 0; sample < numSamples; ++sample) {
             const auto hNotify = ring.ReadFromLittleEndian<uint32_t>();
             const auto size = ring.ReadFromLittleEndian<uint32_t>();
             std::shared_ptr<Notification> ptr;
             {
                LOG_ERROR("From Inside NotificationDispatcher::Run before trying to get the mutex lock");
                std::lock_guard<std::recursive_mutex> lock(mutex);
                LOG_ERROR("From Inside NotificationDispatcher::Run After lock has acquired");
                auto it = notifications.find(hNotify);
                if (it != notifications.end())
                {
                   ptr = it->second;
                }
             }
             LOG_ERROR("From Inside NotificationDispatcher::Run released the lock");
             if (ptr)
             {
                auto& notification = *ptr;
                if (size != notification.Size())
                {
                   LOG_WARN("Notification sample size: " << size << " doesn't match: " << notification.Size());
                   ring.Read(size);
                   return;
                }
                notification.Notify(timestamp, ring);
             }
             else
             {
                ring.Read(size);
             }
         }
      }
   }
}

Thanks, Satheesh

pbruenn commented 7 years ago

Hi Satheesh, your code snippet was very hard to read. I added two more "`" to your comment to increase readability.

I am not sure, if it is a good idea to change the map of notifications into a map of shared pointers. I have to think more about that and have to run some tests before I can patch that. Unfortunately I am out of office until 8th of May. But you could run your own patched version until then.

In my opinion the Callback function called by notification.Notify() should be lightweight such as an interrupt service routine. So I would try to avoid dependencies such as Mutex A in them. Is that possible for your application? Could you provide some example code to trigger your deadlock?

Regards, Patrick

RoelBlomjousDevelopment commented 7 years ago

Hi Patrick,

Your proposal of handling these callbacks as if the are service routines is rather valid. Normally for ISR routines it is not allowed to introduce context switches in the callbacks. So in other words it is not allowed to require a lock. ISR callbacks have one big difference, they are executed in the ISR context. This has the advantage that the cpu is "paused" and no other threads can interfere with the current ISR. The ADS library is using a worker thread for notifying all the clients. So a user of ADS knows one thing for sure, the callback function is not executed on the thread where the subscription was done. So when you want to use the data in another thread than the callback (as what you suggest by using it as an ISR) will require locking or synchronising data via lock free queues. Both solutions have there advantages and disadvantages for the client. I don't think it has added value to keep the lock while calling the callback function. As the lock is only guarding the notifications list. And as Sateesh describes it can even cause problems for the clients. So by changing the ownership of the Notification object to a shared_ptr enables to lock only the list find. As the list might be changed after the Notification object is found. So with the shared_ptr you can make sure that there are no race conditions while removing or adding a notification while calling the Callback.

Regards, Roel Blomjous

pbruenn commented 7 years ago

Hi Roel, thanks for your detailed comment. You are absolutely right it, makes no sense to keep the lock while calling the callback. I prepared a patch 0048238639cc7c4d52adf56aa3afbab8307b59f8 on branch dev-lockfree-callbacks. It is only compile tested by me, as I have no access to any test environment until 8th of May. However, Satheesh and other please give it a try and let me know two things:

does it fix your issue?
does it bring any regressions or performance drawbacks?

I will test and merge this patch, when I am back in the office in May. Regards, Patrick

Satheesh-satz commented 7 years ago

Hi Patrick,

I have verified the patch 0048238. I am not seeing the deadlock issue with this patch, but i am still working on the regression and other performance testing.

I will update you soon with the result.

Thanks, Satheesh

pbruenn commented 7 years ago

Thanks for testing, Satheesh. My tests were successful, too. So I merged to master and pushed it online. The commit hash changed, but the patch should match exactly. During my vacation, we had some minor changes on our internal repository, which I didn't want to confuse. Regards, Patrick

Beckhoff / ADS

DeadLock with the event Notification #43