IOTechSystems / iotech-c-utils

Apache License 2.0
9 stars 7 forks source link

Flaky failure: Segfaults and Aborts when exiting, 2-3% of time #255

Open TomBrennan-Eaton opened 1 year ago

TomBrennan-Eaton commented 1 year ago

I observed this in test runs where a test program that used the device-sdk-c is checked for issues on exit. In runs of 100 instances (new KVM each time), I would observe two different failures about 2-3% of the time.

One failure reported: E level=DEBUG ts=2022-10-21T22:45:00Z msg="Unsubscribe was successful" E my-test: /path-to/device-sdk-c/src/c/iot/component.c:101: iot_component_unlock: Assertion `(pthread_mutex_unlock (&component->mutex)) == 0' failed. E Aborted whereas on success, we would see E level=DEBUG ts=2022-10-21T22:44:38Z msg="Unsubscribe was successful" E level=DEBUG ts=2022-10-21T22:44:39Z msg="Thread exiting"

Looking at the manpages for pthread_mutex_unlock(), there are three possibilities for the error return: EINVAL The value specified by mutex does not refer to an initialized mutex object. EAGAIN The mutex could not be acquired because the maximum number of recursive locks for mutex has been exceeded. EPERM The current thread does not own the mutex.

The other failure type reported: E level=DEBUG ts=2022-10-21T23:07:52Z msg="Unsubscribe was successful" E Segmentation fault So it was failing in about the same place, probably due to an invalid reference.

TomBrennan-Eaton commented 1 year ago

I worked first on solving the problem with iot_component_unlock, then noticed that the reference decrementing function didn't guard against underflow. The changes in the following patch fix both problems; I'm not sure if the fix for iot_component_unlock() is the most appropriate - my operating assumption was that the component pointer or the mutex were invalid. These changes fix the problems even in runs of 1000x duplicates.

diff -ruN device-sdk-c/src/c/orig/component.c device-sdk-c/src/c/iot/component.c 
--- device-sdk-c/src/c/orig/component.c 2022-10-25 11:14:52.266957623 -0400
+++ device-sdk-c/src/c/iot/component.c  2022-10-26 15:06:33.713209902 -0400
@@ -44,6 +44,11 @@

 bool iot_component_dec_ref (iot_component_t * component)
 {
+  if ((component == NULL) || (atomic_load(&component->refs) == 0))
+  {
+    return false;
+  }
+  // Else decrement and return the result
   return (atomic_fetch_add (&component->refs, -1) <= 1);
 }

@@ -97,8 +102,22 @@

 iot_component_state_t iot_component_unlock (iot_component_t * component)
 {
-  iot_component_state_t state = component->state;
-  IOT_RET_CHECK (pthread_mutex_unlock (&component->mutex));
+  iot_component_state_t state = IOT_COMPONENT_STOPPED;
+  if ( component != NULL )
+  {
+    state = component->state;
+    if ((state == IOT_COMPONENT_RUNNING) || (state == IOT_COMPONENT_STARTING))
+    {
+      IOT_RET_CHECK (pthread_mutex_unlock (&component->mutex));
+    }
+    else
+    {
+      /* In the case where this component is being deleted, or component data is just invalid,
+      * the mutex condition is unreliable and we do sometimes get an error return.
+      * Turns out that, in those cases, it is safe to simply ignore the return value. */
+      pthread_mutex_unlock (&component->mutex);
+    }
+  }
   return state;
 }
TomBrennan-Eaton commented 1 year ago

@iain-anderson Will this patch be incorporated?