Closed evanmiller closed 6 months ago
Solid change, but promise.h hasn't been used in any public code yet. I was actually considering removing it from the codebase. Less code is best code, its unit test occasionally fail, and since it only offers a blocking API, it's not useful to building our non-blocking libraries. It's only handy in unit test and samples like this.
Sorry for being a jerk. I'm going to let this PR hang in limbo until the fate of promise.h is decided (this month, likely). We'll remove it, fix it, or replace it with something similar.
Thank you for the swift attention to my patch.
Based on my reading of this SO thread and a similar data race here, it seems that aws_promise_complete
and aws_promise_fail
should perform their CV notifications while still holding the mutex rather than immediately after. If there is a spurious wakeup in the main thread while the worker thread is between these two lines
then the main thread may proceed to destroy the condition variable before the worker can execute its notification. According to this theory, swapping those two lines of code (as well as the last two lines of aws_promise_fail
) will fix the race condition.
As to the question of whether to keep promise.h
: The race condition discovered here would also have presented itself if the original elasticurl
code had attempted to clean up its condition variable. Mutexes and condition variable logic is hard to get right; as long as the library provides mutex and cvar primitives, my vote would be to keep (and fix) promises in order to prevent similar data race conditions in client code.
As to the question of whether to keep
promise.h
: The race condition discovered here would also have presented itself if the originalelasticurl
code had attempted to clean up its condition variable. Mutexes and condition variable logic is hard to get right; as long as the library provides mutex and cvar primitives, my vote would be to keep (and fix) promises in order to prevent similar data race conditions in client code.
Just a minor nitpick, but the current elasticurl implementation does not have a race condition. Predicate-based waits do not suffer from problems when a spurious wakeup occurs.
Just a minor nitpick, but the current elasticurl implementation does not have a race condition. Predicate-based waits do not suffer from problems when a spurious wakeup occurs.
I must be misunderstanding what a spurious wakeup entails. My mental model for the data race is:
Worker sets exchange_completed
to true
Worker unlocks mutex
Main thread spuriously wakes up (returns from pthread_cond_wait
)
Main thread checks predicate condition – condition is true
Main thread returns from aws_condition_variable_wait_pred
Main thread continues
Worker performs notify
In my current understanding, there no data race only because elasticurl
leaks the condition variable resource rather than freeing it. I would appreciate any clarification though.
Just a minor nitpick, but the current elasticurl implementation does not have a race condition. Predicate-based waits do not suffer from problems when a spurious wakeup occurs.
I must be misunderstanding what a spurious wakeup entails. My mental model for the data race is:
Worker sets
exchange_completed
to true Worker unlocks mutex Main thread spuriously wakes up (returns frompthread_cond_wait
) Main thread checks predicate condition – condition is true Main thread returns fromaws_condition_variable_wait_pred
Main thread continues Worker performs notifyIn my current understanding, there no data race only because
elasticurl
leaks the condition variable resource rather than freeing it. I would appreciate any clarification though.
Invoking the library cleanup will also invoke a join function (https://github.com/awslabs/aws-c-common/blob/main/include/aws/common/thread.h#L169-L178) that does not return until all threads spawned by any of the aws-c-* libraries are fully joined.
Now if you cleaned up the condition variable before that happened, then yes, you could theoretically destroy it while it was still in use.
With https://github.com/awslabs/aws-c-common/pull/1023 merged, it would be great if someone could re-run the failed sanitizer test.
Thank you for the PR, sorry; we have decided to remove the promise.h class, so I am closing this PR as well for now.
Issue #, if available:
Description of changes:
promise.h
neatly encapsulates what the mutex and condition variable are doing in theelasticurl
code. I found the code to be more succinct and comprehensible by using it.Tested locally with the success and failure code paths.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.