DOCGroup / ACE_TAO

ACE and TAO
https://www.dre.vanderbilt.edu/~schmidt/TAO.html
698 stars 374 forks source link

Change ACE_Barrier running_threads_ to atomic to ensure accurate values across cores/CPUs #2170

Closed shuston closed 8 months ago

shuston commented 9 months ago

Customer experiences ACE_Barrier that (intermittently) never counts down to 1 under heavy system load. Theorizing that the ACE_Sub_Barrier runningthreads member needs to be atomic to properly be viewed/manipulated across cores/CPUs.

jwillemsen commented 9 months ago

I haven’t look at the existing code to see if there is a race condition possible, but I would like to propose to use a std::atomic here

shuston commented 9 months ago

I haven’t look at the existing code to see if there is a race condition possible, but I would like to propose to use a std::atomic here

I like that... I guess I went too "old school" with this change ;-) I will do that - thanks, Johnny!

shuston commented 9 months ago

@jwillemsen all the Windows tasks fail... do you know if that's "normal"?

jwillemsen commented 9 months ago

I think that is a problem with the vcpkg version used in your branch, I would recommend to merge master into your branch, I think that should resolve that compile issue

jwillemsen commented 9 months ago

ACE_Barrier does have a guard that is locked always before using running_threads_

jwillemsen commented 9 months ago

Maybe worth some test extension? Did you hear from your customer whether this change fixes their issue?

shuston commented 9 months ago

ACE_Barrier does have a guard that is locked always before using running_threads_

Yes, but we added an ACE_DEBUG inside that lock and 5 threads called wait and the runningthreads count was 5 in every thread. The problem only occurs under tremendous load. It may take them some time to get the change into QA and run enough load to see if the problem still occurs.

jwillemsen commented 9 months ago

From a quick code inspection I don't see a real problem, but there is already a test extension in the old bugzilla which the author says doesn't exit, see http://bugzilla.dre.vanderbilt.edu/show_bug.cgi?id=4078, maybe that helps