ThreadSanitizer says ustring.cpp:179 has a data race

gchatelet commented 11 years ago

While using ThreadSanitizer to check Duke's implementation I stumbled upon this data race.

WARNING: ThreadSanitizer: data race (pid=17556)
  Atomic write of size 4 at 0x7ff153d4cdc0 by thread T2:
    #0 __tsan_atomic32_exchange ??:0 (exe+0x0000001376a7)
    #1 OpenImageIO::v1_3::spin_mutex::try_lock() /home/clitte/git/oiio/src/include/thread.h:570 (libOpenImageIO.so.1.3+0x0000003d482a)
    #2 OpenImageIO::v1_3::spin_mutex::lock() /home/clitte/git/oiio/src/include/thread.h:534 (libOpenImageIO.so.1.3+0x0000003d4731)
    #3 OpenImageIO::v1_3::spin_rw_mutex::read_lock() /home/clitte/git/oiio/src/include/thread.h:633 (libOpenImageIO.so.1.3+0x000000b62ff7)
    #4 read_lock_guard /home/clitte/git/oiio/src/include/thread.h:669 (libOpenImageIO.so.1.3+0x000000b62f96)
    #5 read_lock_guard /home/clitte/git/oiio/src/include/thread.h:669 (libOpenImageIO.so.1.3+0x000000b58430)
    #6 OpenImageIO::v1_3::ustring::make_unique(char const*) /home/clitte/git/oiio/src/libutil/ustring.cpp:179 (libOpenImageIO.so.1.3+0x000000b56457)
    #7 ustring /home/clitte/git/oiio/src/include/ustring.h:166 (libOpenImageIO.so.1.3+0x000000306ea6)
    #8 ustring /home/clitte/git/oiio/src/include/ustring.h:167 (libOpenImageIO.so.1.3+0x000000306e10)
    #9 ustring /home/clitte/git/oiio/src/include/ustring.h:186 (libOpenImageIO.so.1.3+0x000000306c67)
    #10 ustring /home/clitte/git/oiio/src/include/ustring.h:186 (libOpenImageIO.so.1.3+0x000000306bd0)
    #11 OpenImageIO::v1_3::ParamValue::init(std::string, OpenImageIO::v1_3::TypeDesc, int, void const*, bool) /home/clitte/git/oiio/src/include/paramlist.h:104 (libOpenImageIO.so.1.3+0x0000002e8399)
    #12 OpenImageIO::v1_3::ImageSpec::attribute(std::string const&, OpenImageIO::v1_3::TypeDesc, void const*) /home/clitte/git/oiio/src/libOpenImageIO/formatspec.cpp:355 (libOpenImageIO.so.1.3+0x0000002d962f)
    #13 OpenImageIO::v1_3::ImageSpec::attribute(std::string const&, char const*) /home/clitte/git/oiio/src/include/imageio.h:306 (libOpenImageIO.so.1.3+0x0000002c2643)
    #14 OpenImageIO::v1_3::JpgInput::open(std::string const&, OpenImageIO::v1_3::ImageSpec&) /home/clitte/git/oiio/src/jpeg.imageio/jpeginput.cpp:212 (libOpenImageIO.so.1.3+0x000000d76af6)
    #15 OpenImageIOReader /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:135 (exe+0x0000001603b6)
    #16 OpenImageIOReader /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:175 (exe+0x00000015fee0)
    #17 duke::OpenImageIODescriptor::getReaderFromFile(char const*) const /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:223 (exe+0x00000015fdba)
    #18 duke::(anonymous namespace)::tryReader(char const*, duke::IIODescriptor const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:45 (exe+0x0000002e7f80)
    #19 duke::(anonymous namespace)::load(char const*, char const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:55 (exe+0x0000002e6d71)
    #20 duke::load(char const*, char const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&, std::string&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:66 (exe+0x0000002e69a3)
    #21 duke::LoadedImageCache::workerStep(std::pair<duke::IMediaStream const*, unsigned long>&, std::string&, std::string&) /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:142 (exe+0x00000024d04c)
    #22 duke::LoadedImageCache::workerFunction() /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:109 (exe+0x00000024c82f)
    #23 void std::_Mem_fn<void (duke::LoadedImageCache::*)()>::operator()<, void>(duke::LoadedImageCache*) const /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:601 (exe+0x00000026ca43)
    #24 void std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:1732 (exe+0x00000026c8d0)
    #25 std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)>::operator()() /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:1720 (exe+0x00000026c840)
    #26 std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)> >::_M_run() /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/thread:115 (exe+0x00000026c7e9)
    #27 <null> <null>:0 (libstdc++.so.6+0x0000000b1d2f)

  Previous write of size 4 at 0x7ff153d4cdc0 by thread T1:
    #0 OpenImageIO::v1_3::spin_mutex::unlock() /home/clitte/git/oiio/src/include/thread.h:553 (libOpenImageIO.so.1.3+0x0000003d4618)
    #1 OpenImageIO::v1_3::spin_rw_mutex::read_lock() /home/clitte/git/oiio/src/include/thread.h:637 (libOpenImageIO.so.1.3+0x000000b63015)
    #2 read_lock_guard /home/clitte/git/oiio/src/include/thread.h:669 (libOpenImageIO.so.1.3+0x000000b62f96)
    #3 read_lock_guard /home/clitte/git/oiio/src/include/thread.h:669 (libOpenImageIO.so.1.3+0x000000b58430)
    #4 OpenImageIO::v1_3::ustring::make_unique(char const*) /home/clitte/git/oiio/src/libutil/ustring.cpp:179 (libOpenImageIO.so.1.3+0x000000b56457)
    #5 ustring /home/clitte/git/oiio/src/include/ustring.h:166 (libOpenImageIO.so.1.3+0x000000306ea6)
    #6 ustring /home/clitte/git/oiio/src/include/ustring.h:167 (libOpenImageIO.so.1.3+0x000000306e10)
    #7 ustring /home/clitte/git/oiio/src/include/ustring.h:186 (libOpenImageIO.so.1.3+0x000000306c67)
    #8 ustring /home/clitte/git/oiio/src/include/ustring.h:186 (libOpenImageIO.so.1.3+0x000000306bd0)
    #9 OpenImageIO::v1_3::ParamValue::init(std::string, OpenImageIO::v1_3::TypeDesc, int, void const*, bool) /home/clitte/git/oiio/src/include/paramlist.h:104 (libOpenImageIO.so.1.3+0x0000002e8399)
    #10 OpenImageIO::v1_3::ImageSpec::attribute(std::string const&, OpenImageIO::v1_3::TypeDesc, void const*) /home/clitte/git/oiio/src/libOpenImageIO/formatspec.cpp:355 (libOpenImageIO.so.1.3+0x0000002d962f)
    #11 OpenImageIO::v1_3::ImageSpec::attribute(std::string const&, char const*) /home/clitte/git/oiio/src/include/imageio.h:306 (libOpenImageIO.so.1.3+0x0000002c2643)
    #12 OpenImageIO::v1_3::JpgInput::open(std::string const&, OpenImageIO::v1_3::ImageSpec&) /home/clitte/git/oiio/src/jpeg.imageio/jpeginput.cpp:212 (libOpenImageIO.so.1.3+0x000000d76af6)
    #13 OpenImageIOReader /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:135 (exe+0x0000001603b6)
    #14 OpenImageIOReader /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:175 (exe+0x00000015fee0)
    #15 duke::OpenImageIODescriptor::getReaderFromFile(char const*) const /home/clitte/git/duke/src/duke/imageio_plugins/openimageio/OpenImageIO.cpp:223 (exe+0x00000015fdba)
    #16 duke::(anonymous namespace)::tryReader(char const*, duke::IIODescriptor const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:45 (exe+0x0000002e7f80)
    #17 duke::(anonymous namespace)::load(char const*, char const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:55 (exe+0x0000002e6d71)
    #18 duke::load(char const*, char const*, std::function<void (duke::RawPackedFrame&&, void const*)> const&, std::string&) /home/clitte/git/duke/src/duke/engine/ImageLoadUtils.cpp:66 (exe+0x0000002e69a3)
    #19 duke::LoadedImageCache::workerStep(std::pair<duke::IMediaStream const*, unsigned long>&, std::string&, std::string&) /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:142 (exe+0x00000024d04c)
    #20 duke::LoadedImageCache::workerFunction() /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:109 (exe+0x00000024c82f)
    #21 void std::_Mem_fn<void (duke::LoadedImageCache::*)()>::operator()<, void>(duke::LoadedImageCache*) const /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:601 (exe+0x00000026ca43)
    #22 void std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)>::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:1732 (exe+0x00000026c8d0)
    #23 std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)>::operator()() /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/functional:1720 (exe+0x00000026c840)
    #24 std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (duke::LoadedImageCache::*)()> (duke::LoadedImageCache*)> >::_M_run() /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/thread:115 (exe+0x00000026c7e9)
    #25 <null> <null>:0 (libstdc++.so.6+0x0000000b1d2f)

  Thread T2 (tid=17560, running) created by main thread at:
    #0 pthread_create ??:0 (exe+0x000000123772)
    #1 <null> <null>:0 (libstdc++.so.6+0x0000000b1f7e)
    #2 std::thread::thread<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/thread:138 (exe+0x000000268ab0)
    #3 void __gnu_cxx::new_allocator<std::thread>::construct<std::thread, void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(std::thread*, void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/ext/new_allocator.h:120 (exe+0x0000002689f5)
    #4 _ZNSt16allocator_traitsISaISt6threadEE12_S_constructIS0_JMN4duke16LoadedImageCacheEFvvEPS5_EEENSt9enable_ifIXsr18__construct_helperIT_DpT0_EE5valueEvE4typeERS1_PSA_DpOSB_ /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/alloc_traits.h:254 (exe+0x0000002688fd)
    #5 _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_JMN4duke16LoadedImageCacheEFvvEPS5_EEEDTcl12_S_constructfp_fp0_spclsr3stdE7forwardIT0_Efp1_EEERS1_PT_DpOS9_ /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/alloc_traits.h:393 (exe+0x000000266acd)
    #6 void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/vector.tcc:409 (exe+0x000000266d0e)
    #7 void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/vector.tcc:101 (exe+0x00000024f497)
    #8 duke::LoadedImageCache::startWorkers() /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:93 (exe+0x00000024ba46)
    #9 duke::LoadedImageCache::load(duke::Timeline const&) /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:32 (exe+0x00000024bbbb)
    #10 duke::LoadedTextureCache::load(duke::Timeline const&) /home/clitte/git/duke/src/duke/engine/cache/LoadedTextureCache.cpp:20 (exe+0x00000020d995)
    #11 duke::Player::load(duke::Timeline const&, FrameDuration const&) /home/clitte/git/duke/src/duke/engine/Player.cpp:11 (exe+0x000000277019)
    #12 duke::DukeMainWindow::load(duke::Timeline const&, FrameDuration const&, duke::FitMode, int) /home/clitte/git/duke/src/duke/engine/DukeMainWindow.cpp:64 (exe+0x0000001e9269)
    #13 DukeApplication /home/clitte/git/duke/src/duke/engine/DukeApplication.cpp:136 (exe+0x0000001be2ef)
    #14 main /home/clitte/git/duke/src/duke/main.cpp:37 (exe+0x000000168d34)

  Thread T1 (tid=17559, running) created by main thread at:
    #0 pthread_create ??:0 (exe+0x000000123772)
    #1 <null> <null>:0 (libstdc++.so.6+0x0000000b1f7e)
    #2 std::thread::thread<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/thread:138 (exe+0x000000268ab0)
    #3 void __gnu_cxx::new_allocator<std::thread>::construct<std::thread, void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(std::thread*, void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/ext/new_allocator.h:120 (exe+0x0000002689f5)
    #4 _ZNSt16allocator_traitsISaISt6threadEE12_S_constructIS0_JMN4duke16LoadedImageCacheEFvvEPS5_EEENSt9enable_ifIXsr18__construct_helperIT_DpT0_EE5valueEvE4typeERS1_PSA_DpOSB_ /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/alloc_traits.h:254 (exe+0x0000002688fd)
    #5 _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_JMN4duke16LoadedImageCacheEFvvEPS5_EEEDTcl12_S_constructfp_fp0_spclsr3stdE7forwardIT0_Efp1_EEERS1_PT_DpOS9_ /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/alloc_traits.h:393 (exe+0x000000266acd)
    #6 void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/vector.tcc:409 (exe+0x000000266d0e)
    #7 void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<void (duke::LoadedImageCache::*)(), duke::LoadedImageCache*>(void (duke::LoadedImageCache::*&&)(), duke::LoadedImageCache*&&) /usr/lib64/gcc/x86_64-unknown-linux-gnu/4.8.2/../../../../include/c++/4.8.2/bits/vector.tcc:101 (exe+0x00000024f497)
    #8 duke::LoadedImageCache::startWorkers() /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:93 (exe+0x00000024ba46)
    #9 duke::LoadedImageCache::load(duke::Timeline const&) /home/clitte/git/duke/src/duke/engine/cache/LoadedImageCache.cpp:32 (exe+0x00000024bbbb)
    #10 duke::LoadedTextureCache::load(duke::Timeline const&) /home/clitte/git/duke/src/duke/engine/cache/LoadedTextureCache.cpp:20 (exe+0x00000020d995)
    #11 duke::Player::load(duke::Timeline const&, FrameDuration const&) /home/clitte/git/duke/src/duke/engine/Player.cpp:11 (exe+0x000000277019)
    #12 duke::DukeMainWindow::load(duke::Timeline const&, FrameDuration const&, duke::FitMode, int) /home/clitte/git/duke/src/duke/engine/DukeMainWindow.cpp:64 (exe+0x0000001e9269)
    #13 DukeApplication /home/clitte/git/duke/src/duke/engine/DukeApplication.cpp:136 (exe+0x0000001be2ef)
    #14 main /home/clitte/git/duke/src/duke/main.cpp:37 (exe+0x000000168d34)

SUMMARY: ThreadSanitizer: data race ??:0 __tsan_atomic32_exchange

gchatelet commented 11 years ago

I tried several configurations within ustring.cpp

Use rw spin locks : fails (default)
Use spin locks : fails
Use regular mutex : succeeds

So it looks like an issue with the spin lock implementation/usage.

gchatelet commented 11 years ago

Regarding the spin_lock implementation in thread.h, as pointed out by this quite old Intel's article the volatile keyword is basically useless for multithreading purpose and "will simply cause your code to run a lot slower".

Then I think the read in the lock function should be atomic too :

while (! OIIO_UNLIKELY(try_lock())) {
    do {
        backoff();
    } while (m_locked); // <- here, should be __sync_fetch_and_add(&m_locked, 0)

Removing the volatile keyword and using an atomic read in lock seems to fix the issue.

lgritz commented 11 years ago

I'm confused by this, maybe you can clarify.

Sure, I know that volatile does not mean atomic. But it does mean "could change any time", and so is a hint to the optimizer that is should be read every time it's referenced, not assumed to never change. Am I wrong about that? Is that not what you'd want in a loop where the value is a flag? If the volatile truly doesn't help, we can remove it, but I don't see how it actually hurts in the case of something is intended to be used as storage for an atomic.

And I don't see why the 'while (m_locked)' needs to be a __sync_fetch_and_add (essentially, an atomic read). Why should that be an atomic read? The write is atomic. This read is just a gatekeeper; even when it succeeds, it will proceed to the try_lock() which is a true atomic grab. My recollection is that an ordinary access here was substantially faster than an atomic read.

Am I totally off-base here?

lgritz commented 11 years ago

In fact, here's the complete code snippet in question, with the comments that explain the rationale:

    // Try to get ownership of the lock. Though experimentation, we
    // found that OIIO_UNLIKELY makes this just a bit faster on 
    // gcc x86/x86_64 systems.
    while (! OIIO_UNLIKELY(try_lock())) {
        do {
            backoff();
        } while (m_locked);

        // The full try_lock() involves a compare_and_swap, which
        // writes memory, and that will lock the bus.  But a normal
        // read of m_locked will let us spin until the value
        // changes, without locking the bus. So it's faster to
        // check in this manner until the mutex appears to be free.
    }

So the idea here is that the fetch_and_add that you recommend is inherently a write operation, which will lock the bus, slowing down the loop if there are multiple threads spinning on the same lock. But we're not trying to write here at all! We just want to know when it's 0 (after which we'll do a true atomic try_lock), so for those reads in the tight loop, we want specifically to avoid the write and bus lock.

Help me understand how this is wrong.

gchatelet commented 11 years ago

Hi Larry,

Sorry for the lag, was very busy, and out of the country for a few days.

So I'm not an expert but my understanding of it is that volatile basically says : read from main memory anytime (or whatever is memory mapped to this address) but it doesn't perform a cache sync nor guarantees that operations are done in order.

On modern processors each core has its own L1 cache associated to it and they usually share the L3 cache. Memory barriers are here to say how the core caches should be synchronized. volatile here doesn't help because reading from main memory does not enforce local caches to be synchronized and decrease performance a lot because it forces a read over the bus.

Here are a bunch of links I included in a talk I did last year (relevant slide is starting at 26) :

Understanding Violations of Sequential Consistency
Concurrency in the D Programming Language (especially page 2 : A Brief History of Data Sharing )

You should also read C++ concurrency in Action from Anthony Williams (the author of the thread library included in C++11) that digs into the details.

Also this talk from Bill Dally (NVidia) is pretty enlightening - you'll need to install silverlight though.

I agree that a write is not necessary here but you have to use a memory barrier to do the read (I don't know the gcc intrinsics enough to pick the correct one so picked the first one that could do)

Hope that helps !

lgritz commented 11 years ago

Thanks for the links to your talk, I look forward into diving into that in more detail.

I understand that volatile gives no protection and does not imply atomicity. No worries, I always knew that and am not using volatile expecting any protection. I'm strictly using it to indicate to the compiler that I do indeed expect the value to potentially change out from under it at any time (in this case, because another process may write to it). Maybe this use of volatile is superfluous and adds nothing, but given the way I use the atomics, I don't see how it can be wrong or hurt performance.

I'm still confused about the specifics of how this loop (the one we both quoted above) can go wrong. All the non-atomic check in the inner loop is doing is throttling the rate at which we do the atomic (writing / bus locking) check in the outer loop. The outer loop won't be exited until it has a proper atomic lock, no matter how botched the inner check may be. It can return the wrong result, and it still won't matter -- the worst that can happen is that it will either spin a couple more times (non-atomically/non-bus-lockingly) before escaping to the outer loop, or that it will exit the inner loop too soon, in which case it will still get caught properly by the outer (atomic) check.

Empirically, I've found the many-core contention to be much higher performance with the loop this way (some of the processors doing read-only non-atomic checks if they don't succeed the first time with the atomic check). So what I'm looking for, if you are arguing to returning to the lower-performance fully-atomic alternative, is a clear explanation for how my code can get the wrong answer. I still don't quite see how the lock is wrong.

lgritz commented 11 years ago

In other words, I claim that thread sanitizer is giving a false positive here, and there is no bug. Please convince me otherwise!

gchatelet commented 11 years ago

Just re-read everything thoroughly. A few points :

I agree with your reasoning and understand the rationale now. This volatile read simply makes the lock spin slower (4 times slower according to the timings here) so the behavior is nicer with a lot of contention but latency is also increased even if only two threads compete for the lock. Specifying volatile here should prevent compiler from caching the value.
Then ThreadSanitizer does not actually complains about the particular line with the volatile read. It complains about the write here

// Fastest way to do it is with a store with "release" semantics
__asm__ __volatile__("": : :"memory");
m_locked = 0;

which conflict with the atomic_compare_and_exchange in lock(). So either ThreadSanitizer does not understand the asm statement, or the asm statement semantic is not compatible with the compare and exchange semantic. TSan works by instrumenting the code during the compilation through Clang so the full semantic should be available. I will open an issue and link here so TSan guys can have a look and confirm it's a false positive or not.

lgritz commented 11 years ago

I have unit tests that are capable of benchmarking these locks under a variety of loads. I'll remove the volatile keyword and see what happens. I'll report back with the results when I get a chance to do this.

gchatelet commented 11 years ago

I've created this issue

gchatelet commented 11 years ago

According to comment on the issue I think we should get rid of volatile and use correct intrinsic instead of hacky x86 specific asm. This would also make the code portable to other platforms.

lgritz commented 11 years ago

I ran some benchmarks (Linux, on a system with 12-core + hyperthread):

        (old)      no           lock with        lock with no   unlock with
threads volatile   volatile  sync_fetch_and_add  inner loop     sync_lock_release
------- ----------
 1        0.4       0.4           0.4             0.4             0.4  
 2        0.5       0.6           0.6             0.7             0.6  
 4        1.7       1.5           1.6             1.6             1.7  
 8        3.1       3.6           6.3             6.1             3.4  
12        4.7       4.7          10.0             9.2             4.7  
16        4.4       4.5           8.8             8.2             4.6  
20        4.4       4.4           5.6             8.0             4.1  
24        4.1       4.1           6.1             6.6             4.1

The timings have an inherent error of ~0.1s that is hard to eliminate. So don't worry too much about very small differences unless they are consistent across thread counts. These are the "best of 5 runs".

The "(old) volatile" is the original code. Removing the volatile marker for the spinlock data member seems to have little or no negative performance impact (within measurement ability). So if you guys really think volatile is bad, I am ok with removing that keyword. But it certainly does not improve performance to do so (confirming my expectations).

The next column additionally alters spinlock::try_lock by replacing the inner loop's non-atomic check with a __sync_fetch_and_add, as you suggested in one of your earlier comments. As I found before, this has a very large performance penalty when >=8 threads are involved (it doesn't seem to matter for <= 4 threads). So we definitely don't want to do that unless somebody can really convince me that the code is wrong the current way, and I'm currently fairly sure it's correct.

The next column eliminates the "inner loop" entirely. We can see that it is still much slower then the old code -- having that second loop that spins without locking the bus really does serve a purpose.

The final column tries changing spinlock::unlock to use __sync_lock_release instead of my previous attempt of having a memory barrier and then an ordinary assignment. (In this test, I still had the 'volatile' removed, but put the try_lock() back to its original form.) These should be equivalent, though there's a comment in the code that indicates that I once did a comparison and thought there was an improvement with the barrier-and-assign technique. But my current timings show them to be roughly equivalent.

Bottom line:

There's no evidence that the current code is wrong. I believe that tsan is incorrect in this case because it's misunderstanding the intent of the code. My unit test not only does the timing, it also checks the results, and they are correct; if there were a race condition, it would be failing the assertion.
Removing 'volatile' and changing unlock() to use __sync_lock_release do not seem to hurt, so I'm willing to do them if that prevents tsan from having false positives.
I will not change try_lock in the manner originally suggested, because there is a very large performance gain with the non-bus-locking inner loop trick I'm using.

gchatelet commented 11 years ago

Thank you for the benchmarking this is very interesting.

A few comments :

Sorry if I wasn't clear but that's for sure that __sync_fetch_and_add is not the way to go for performance, it was just a test to see if a memory barrier would prevent TSan from complaining. Of course there is no need for a write I was just too lazy trying to find the correct intrinsic function, I was not recommending this code by any means.
From the comment on the TSan issue, it is clear that TSan only understand standard atomic semantic and so I guess it missed the asm statement.
Having tests pass on your machine does not mean the code is correct, a different compiler or a different CPU could trigger the data race. For instance, nothing prevents the compiler from reordering instructions, let's say you wrote :

__asm__ __volatile__("": : :"memory");
valueA = 0;
++valueB;

It is perfectly legal for the compiler to compile it as

__asm__ __volatile__("": : :"memory");
++valueB;
valueA = 0;

As the outcome is the same in the sequential world - no interaction between valueA and valueB. In that case wouldn't the asm semantic be applied to the wrong statement ? I think that's why we should use what's been defined by the standard and not write asm : the semantic is not the same.

I will also change the code on my side to use __sync_lock_release instead of the asm statement and check whether TSan still complains ( it shouldn't ). I'll post the result on this thread.

gchatelet commented 11 years ago

Just re-read my previous post and yes I did mention // <- here, should be __sync_fetch_and_add(&m_locked, 0), this is not what I meant, sorry for the confusion.

lgritz commented 11 years ago

I'm not claiming that because my tests don't fail on one machine / compiler / test, it's correct.

However, I am claiming that because my analysis of the code indicates that it's right, and it passes the unit tests on several machines with different processor revisions, OS's, and compilers, and it has worked flawlessly in the middle of several large software systems at production scale for months or years with no reported errors, the circumstantial evidence is strong that it is not flawed.

You are quite right that the compiler is free to reorder instructions -- but I believe that the way this code is used, it's fine for that to happen, as long as it doesn't move the value=0 earlier, which it won't because that's exactly what the memory barrier prevents. Moving the value=0 (which serves to release the spin lock) later can't possibly cause incorrect results; the worst it can do is cause another thread to spin longer before it can acquire the lock. And we know from the benchmarks that it's not hurting performance in this manner.

So in this particular case, the code is correct.

If switching this asm barrier + assignment to a __sync_lock_release makes it so that tsan is no longer confused, I am willing to do so, since the benchmarks don't show that to be a significant performance penalty.

gchatelet commented 10 years ago

Did some more tests on my side

Here is the first patch I tried

diff --git a/src/include/thread.h b/src/include/thread.h
index e389ebb..12270b6 100644
--- a/src/include/thread.h
+++ b/src/include/thread.h
@@ -549,8 +549,7 @@ public:
     void unlock () {
 #if defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
         // Fastest way to do it is with a store with "release" semantics
-        __asm__ __volatile__("": : :"memory");
-        m_locked = 0;
+        __sync_lock_release (&m_locked);
         // N.B. GCC gives us an intrinsic that is even better, an atomic
         // assignment of 0 with "release" barrier semantics:
         //  __sync_lock_release (&m_locked);
@@ -595,7 +594,7 @@ public:
     };

 private:
-    volatile int m_locked;  ///< Atomic counter is zero if nobody holds the lock
+    int m_locked;  ///< Atomic counter is zero if nobody holds the lock
 };

This lead to 5 warnings from TSan, mostly related to some other volatile in the same file.

Patching again :

diff --git a/src/include/thread.h b/src/include/thread.h
index e389ebb..12270b6 100644
--- a/src/include/thread.h
+++ b/src/include/thread.h
@@ -651,7 +650,7 @@ public:
         // Spin until the last reader is done, at which point we will be
         // the sole owners and nobody else (reader or writer) can acquire
         // the resource until we release it.
-        while (*(volatile int *)&m_readers > 0)
+        while (m_readers > 0)
                 ;
     }

The volatile int cast actually removes the guarantees of atomic_int and reverts it to basic int.

Then TSan still reports one warning :

  Atomic write of size 4 at 0x7fbf96297dc0 by thread T2:
    #0 __tsan_atomic32_store ??:0 (exe+0x000000136918)
    #1 OpenImageIO::v1_3::spin_mutex::unlock() /home/clitte/git/oiio/src/include/thread.h:549 
    #2 OpenImageIO::v1_3::spin_rw_mutex::write_unlock() /home/clitte/git/oiio/src/include/thread.h:661
  Previous read of size 4 at 0x7fbf96297dc0 by thread T1:
    #0 OpenImageIO::v1_3::spin_mutex::lock() /home/clitte/git/oiio/src/include/thread.h:537 
    #1 OpenImageIO::v1_3::spin_rw_mutex::write_lock() /home/clitte/git/oiio/src/include/thread.h:649

This is again because m_locked is written atomically and then read without atomic guarantees. As you said this is probably OK.

I'm just curious, could you benchmark the following implementation ?

class spin_mutex {
    bool flag;
public:
    spin_mutex() : flag(false) {}
    void lock() { while( __atomic_test_and_set(&flag, __ATOMIC_ACQUIRE)); }
    bool try_lock () { return !__atomic_test_and_set(&flag, __ATOMIC_ACQUIRE); }
    void unlock() { __atomic_clear(&flag, __ATOMIC_RELEASE); }
};

This is an adaptation of the spinlock code used in C++ concurrency in Action (basically replacing C++11 with gcc intrinsics and adding try_lock). With this implementation and the previous patches TSan is happy.

I guess the benchmark spawns X threads which are all fighting for the same lock. I doubt this situation actually happens in real life and if it does this should be pretty rare. Did you measure a significant difference in real application execution time between the naive mutex lock and the spin lock ? My point here is that, by optimizing for the benchmark, you could be building a somewhat complex slow spinning lock that would consume as much CPU than a simple mutex lock without real benefits. I thought about that while reading boost spinlock implementation's discussion and noticed you ended up implementing this backoff system to minimize contention

lgritz commented 10 years ago

Your patch 1 is exactly what I did for the "lock with sync_fetch_and_add" column in my chart.
But the intent of that code is indeed to cast away the atomic, in an effort to spin with true reads only. This is higher performance than doing an atomic read here, because the atomic read will lock the bus. It's safe for the same reason as the other spot we mentioned: the worst that can happen is that it will take slightly longer to notice that the count has gone to zero, but once it thinks it's zero it can never be wrong because nobody else can increment it again until m_locked is released (see the implementation of read_lock()).
Yes, I'll test it when I get the chance. (May not be until Monday.) But my guess is that the lock() implementation is going to be low performance, because it spins too tightly with an operation that locks the memory bus. (Incidentally, I think those _atomic* functions are C11, and require newer compiler than many of our users have.)
It's hard to come up with the "right" benchmark, so I use a combination of a "highest possible contention" artificial benchmark, and timings of real jobs running (with less contention than everybody spinning for access to a single word). Even in the real world tests, some of these resulted in measurable differences in performance. I'm not purposely picking an abstruse method. And there's a big performance difference between any of these spin mutex implementations and a true blocking mutex. You only want a blocking mutex if the number of ops inside the critical section is much more expensive than acquiring the lock itself.

lgritz commented 10 years ago

testtex also has a mode that tests the ImageCache under a variety of loads, some of which are good matches for real world. I'll test with that as well.

gchatelet commented 10 years ago

That's what I wanted to check, thanks.
Oh yes of course. My take on it is that it is undefined behaviour from a standard perspective, but yes it happens to work on x86. I'm not sure it would on ARM for instance and then the compiler can't warn you because it doesn't understand what mean.
If it is lower performance (which is expected I agree), you can still change __ATOMIC_ACQUIRE to __ATOMIC_RELAXED which should be the same as the bare read except you'd express the semantic more clearly to both the user and the compiler. The atomic_XXX functions are available from GCC 4.7 only, before one should use the __sync_XXX functions. From GCC's documentation the closest functions seems to be `sync_lock_test_and_setand__sync_lock_release` which provide acquire / release (ie. no full barrier). To my knowledge you can't specify the relaxed read semantic with GCC's intrinsic prior 4.7. So I concede the current code is the best you can do with pre GCC 4.7, and indeed the C++ memory model is not defined prior C++11.
I agree on the difficulty of coming up with the right benchmark, I just wanted to know if it was noticeable. TSan revealed the "bug" from within ustring which doesn't look like subject to heavy contention. It occurs only at startup time when creating some immutable constants. Here the use of spinlocks looks rather overkill. Spinlock might indeed enhance performance in other use cases. I'm looking forward to reading your results.

ChristopheLorenz commented 10 years ago

What you are trying to achieve is called double checked locking. And it is bad design.

The very first reason it is bad design is because your code has an unexpected behaviour. Although it works (today), compiler or hardware developers expect you to code following standard patterns that are inherently safe. Based on intrinsically correct code, they can design and implement optimizations. But if you don't play by the book, what works today might fail tomorrow. You also haven't proved that the application works. But merely that your tests didn't failed N times. Concurrency issues are extremely hard to impossible to diagnose by testing. That is why so many tools exist to validate a design by coding patterns, and not by testing.

I have a multi-threaded io library that worked for 10 years and rendered something like 200.000.000 frames in production without a glitch using between 4 and 8 cores. Then came a machine with a I7 8 cores&HT processor under Win7 and the lib started to fail by swapping 2 frames randomly around every 100 frames. Once I used more strict "by the book" synchronization, it worked as before.

The advent of out of order execution broke many software that relied on ordered execution for concurrent access. Good synchronisation practices would have avoided the issues. In today's CPUs, it is normal behaviour to have multiple values at the same memory location, because caches are not in sync by design. Concurrent access can only have guaranteed behaviour if you use the hardware designed for that purpose. Using other "tricks" will only weaken your design, and more problematic, make the issues even harder to reproduce and diagnose. No one knows what other trick compilers and hardware designers will use in their next version.

Now it doesn't necessary mean it is bad. It can be totally legitimate to implement it if the gain is significant and you can validate it works "good enough" on the intended platforms you target. By good enough, I mean it statistically won't fail enough to be an issue. Still, I wouldn't take the bet for a nuclear plan.

But you have to accept the fact that it will probably fail tomorrow and that no one will know why. My best advice is to implement the optimization as a compiling option, so people can at least : -know there could be some issue with that code -have a way to diagnose it, or make sure another problem is not related to that hack -fix it by changing a flag.

Later on you can play with the flag and see if it really has a significant impact on performance. If it can't even be measured, disable it by default.

But using it as a common coding practice is like attaching a piece of rock to your leg, claiming there's no water around.

Chris.

ChristopheLorenz commented 10 years ago

Regarding the volatile keyword, it is probably useless... or not... C and C++ have no notion of threads, so it is at the discretion of the compiler. It merely tells the compiler "re-read the data on each access". But it doesn't mean two threads will actually read the same data, just that the instructions set generated will not use a previously read value stored in a cpu register. This is important for example to access hardware registers that have nothing to do with multi-threading. The value read will be coherent to the core executing the instruction sequence, but it might actually come from a cache and not the actual memory. Behaviour between different threads is not defined for volatile. On the other end, using a lock has a defined and guaranteed multi-thread behaviour, regardless of the fact that you indicated the lock is volatile or not. But the implementing library might have to declare the lock volatile for correct behaviour or maybe it doesn't make sense because most locking implementations are done in assembly, where volatile has no meaning.

lgritz commented 10 years ago

I hope it's clear from the above discussion that the 'volatile' was never intended to enforce any thread safety, but for its intended meaning -- to ensure that every time it's referenced, it is read anew. Is that not what you'd want when the sole purpose of the variable is to be modified by another thread and you want to notice as soon as possible?

gchatelet commented 10 years ago

@ChristopheLorenz : starting from C++11 the memory model is completely defined : threads, mutex, happens-before relationship, strong/weak/relaxed synchronization. Before you need to rely on libraries or compiler intrinsic, language would guarantee nothing. @lgritz : doing some research on the subject, I recovered this article CPU Cache Flushing Fallacy which explains the details of CPU cache synchronization on x86. For what I read, the C volatile keyword semantic is do not cache within register but read it from the cache subsystem and not inevitably from main memory as I thought. It is also stated that cache system is guaranteed to be coherent. So if what's in this article is true, then the implementation is correct for x86 as it is designed today. Although from a language point of view this is not enforced.

lgritz commented 10 years ago

@ChristopheLorenz : thanks for the insightful comments about double-checked locking. I went and read a bunch of references about this, learned a lot, and in this case am still convinced that the code is correct because it is contained in an outer loop that uses true atomics. In other words, in this particular case, the actual work is done by atomics, and the non-atomic check is only used to throttle the rate at which the atomic tests are run and lock the bus in time of heavy contention. Incorrectly exiting the inner loop early, running the inner loop too long, or removing the inner loop entirely -- none of these can make incorrect behavior, because the inner loop doesn't modify anything, and when it exists it hits the outer, atomic, loop. The only thing that is affected is performance, and empirically the trick I'm doing gives much higher performance. I already cited the numbers for an artificial stress test, and I will rerun the benchmarks tomorrow and report the difference for a test representative of real workloads.

That said, I believe I have other, unrelated DCLP instances in my code that probably do it in ways that aren't guaranteed to be safe (at least for future platforms). Thanks for the tip, I will fix those.

I think it's a reasonable approach to make using the "faster but possibly unsafe in the future" code to be controlled by a build-time option. The default can be "guaranteed safe and tsan understands it", but I know that my site will choose the faster option, even at the risk that some hardware we don't currently use may have a rare failure case. Like you said, this isn't running a nuclear plant. Empirically, the failure rate is either nonexistent or extremely low (we haven't seen a threading bug for a long, long time), but the value of having a few percent higher texture performance is very high.

Like I said, I'll rerun the benchmarks and report back. If I can't demonstrate a speed gain at all for the real tests, then maybe the option is not necessary. (Though my recollection is that the code is in its current state because it had higher real-world performance.)

gchatelet commented 10 years ago

Double checked locking can work by making use of thread local storage as illustrated in this post. But classical implementations are indeed plain wrong.

lgritz commented 10 years ago

Double checked locking also works just fine if you put the right kind of memory fence in the right place. The criticism about DCLP is that prior to C++11, there was no standard portable C or C++ method for it. But C++11 or the memory fence intrinsics of gcc or that other compilers provide should be fine.

lgritz commented 10 years ago

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686355(v=vs.85).aspx

"With Visual Studio 2005, the compiler uses acquire semantics for read operations on volatile variables and release semantics for write operations on volatile variables (when supported by the CPU)."

If I understand this correctly, it indicates that there may in fact be a purpose to the original 'volatile' keyword, at least on that platform. Though it cannot be counted on to do so for all platforms.

lgritz commented 10 years ago

I was looking at that code again, with the intent of adding a build-time switch to turn off the inner loop that throttles the bus locking... and... I'm having a hard time bringing myself to do it as the default. I understand that non-atomic operations for shared variables are in general wrong, and that DCLP can be dangerous if you don't have the right fences, but again I come back to this case being safe because there's still an atomic fence around the critical operation, the non-atomic check only serves to introduce delay between the atomic checks.

The only convincing argument I've heard so far is tsan is (incorrectly) confused about this trick, and tsan is sufficiently useful for other things that we should have an option where we use a (slower) construct that tsan will not misunderstand.

My current plan is to use the real-world benchmark to decide whether the default choice should be fast, or tsan-friendly.

gchatelet commented 10 years ago

But C++11 or the memory fence intrinsics of gcc or that other compilers provide should be fine.

Yes or by using std::call_once

"With Visual Studio 2005, the compiler uses acquire semantics for read operations on volatile variables and release semantics for write operations on volatile variables (when supported by the CPU)." If I understand this correctly, it indicates that there may in fact be a purpose to the original 'volatile' keyword, at least on that platform.

This is interesting. That's not my understanding though. If Microsoft decided to add acquire/release semantic to volatile that's fine but that is not what the standard states so it is not on purpose. Just implementation details. Java on the contrary specifies that volatile should be synchronized. Also this means that on Windows the SpinLock implementation with acquire/release semantic should be as fast as the volatile version. Would be nice to see your benchmark on windows too. In theory it should even be faster because no full barrier would be used.

My current plan is to use the real-world benchmark to decide whether the default choice should be fast, or tsan-friendly.

SGTM. I'm looking forward to looking at the results. Also this thread proves that correctness of this code is hard to check.

Regarding the Tsan issue, my original intend was to check my code with it, not oiio, and since it looks correct I won't complain but some other people might end up filling a bug here again for the very same reasons.

lgritz commented 10 years ago

OK, I ran full comparisons between the original code ("baseline"), the version where I replaced the asm with __sync_lock_release ("new"), and a version that additionally removes the DCLP as a build-time option.

There's a spinlock high-contention test, a spin_rw_lock high-contention test, and a real-world test involving the TextureSystem/ImageCache under an artificial stress that's designed to mimic real world conditions.

Timings are very hard to reproduce, so ignore small differences, especially when one configuration is slightly faster in some tests and slightly slower in others. But when you see big differences, or when one configuration is always slightly faster on all tests, then that's significant.

spinlock_test --threads 32 --trials 5 --wedge
threads baseline    new      new, no DCLP
 1     0.4       0.4       0.4
 2      0.5       0.6       0.6
 4      1.8       1.6       1.6
 8      3.7       3.2       5.8
12      4.1       4.9       9.4
16      4.6       4.5       8.1
20      4.6       3.4       7.6
24      3.4       4.1       5.7
28      3.7       4.0       6.7
32      4.0       4.0       6.8

spin_rw_test --threads 32 --trials 5 --wedge --rwratio 99
threads baseline    new      new, no DCLP
 1       0.4         0.4       0.4
 2       0.7         0.9       0.8
 4       2.6         2.8       2.9
 8       6.5         6.6       7.6
12       9.0         9.1      11.9
16       8.6         8.7      10.6
20       8.2         8.8       9.8
24       7.9         7.9       9.9
28       7.8         8.0       9.2
32       7.9         7.8       9.0

testtex --nodedup --threads 24 --wedge --threadtimes 7 tmp/grid??.tx
threads baseline    new       new, no DCLP
 1       4.79       4.51       4.60  
 2       4.63       4.57       4.93  
 4       4.61       4.79       4.82  
 6       4.57       4.68       4.84  
 8       4.83       4.79       4.88  
10       5.04       4.93       5.11  
12       5.61       5.07       5.24  
16       7.40       7.13       7.28  
20       8.80       8.64       9.36  
24      10.36      10.38      10.67

It seems clear to me that even in the "real world" test, there is a slight speed gain for the DCLP version. I'm still convinced that the code is correct, however I'm proposing a modification where a build time option OIIO_THREAD_ALLOW_DCLP=0 may be set that uses the more standard construct, suitable for when you are using Thread Sanitizer, or if you just don't trust our DCLP code. But the default is the fast version, which I still am convinced is correct.

Proposed modification in #744

gchatelet commented 10 years ago

Thanks a lot for the tests.

Is testtex still considered a stress test (on textures) or is it quite representative of the production ? Considering the noisy signals I would say the gain of DCLP for this particular test is <1%. For my particular application I think it would be indiscernible.

The DCLP has no benefits for 4 threads but exhibit significant gain starting at 8 threads. My guess here is that for 4 threads the scheduler will pack all the threads on the same core. When more threads are created they are scheduled on other cores/sockets and communication cost gets higher.

I would like to run some more tests on my side but I agree with the proposal.

lgritz commented 10 years ago

Real production varies a lot from scene to scene. This particular mode of testtex is designed to be representative of heavy production texture loads. It's certainly using the locks in their natural place in the TextureSystem, as opposed to spinlock_test which is just testing the bare lock in an unrealistically tight loop, so in that sense it's a more realistic test of how code changes might translate to performance differences in the midst of a real render.

When the threads of the process spill to multiple processors in the box (not just multiple cores on one proc), that's when the difference is more discernible (and generally, when our thread efficiency more prominently falls off a linear path). I think the machine I have is 2 procs x 6 cores each, so the numbers for >6 threads are where you can see a bigger difference.

For >= 8 threads, I'm seeing 2-3% performance increase. For a 2-line change in the source, that's a gain I'm happy to take. Scaled across our entire render farm (and/or our entire crew of people waiting for renders), that's equivalent to real money.

ChristopheLorenz commented 10 years ago

I don't know what is the exact purpose of that part of code ... but you might want to check the latest advances in concurrent access patterns. There's now an emerging "optimistic synchronisation" method that doesn't require actual heavy synchronisation. It seems to be quite serious topic, and not "just another guy thinking he can do better". It however can only be applied to simple stuff, a fifo is a good candidate. And as far as I can tell, only for single producer / single consumer type of access.

Search for "locking free fifo" for some ressoruces. All that move seems to have originated from a paper : http://www.research.ibm.com/people/m/michael/podc-1996.pdf A paper on an optimistic fifo : http://people.csail.mit.edu/edya/publications/OptimisticFIFOQueue-journal.pdf

anthonywilliams commented 10 years ago

I've come here after being mentioned in Guillaume's post on Google+: https://plus.google.com/u/0/+GuillaumeChatelet/posts/UUqXpPnsxcq?cfem=1

If you have C++11 atomics available, you really ought to use them: this tells the compiler exactly what you want, and gives it the best chance of optimizing whilst preserving the correct semantics. Under this scenario, the m_locked member of the spin_mutex class should be std::atomic<int> rather than volatile int. You can then put std::memory_order_release on the store in unlock, and std::memory_order_acquire on the compare-exchange in try_lock.

Any use of non-atomic variables for synchronization is strictly a data race and undefined behaviour, so you're relying on implementation-defined semantics, if any, and you should avoid this.

If you don't have C++11 atomics available, then you're stuck with implementation-defined semantics anyway.

With gcc you are probably best using the __sync_* functions, since these are compiler intrinsics, and again allow the compiler to optimize whilst preserving the semantics. If you are willing to write platform-specific assembler then you might be able to do better, but that's a lot of work to get right.

With MSVC you can use the Interlocked* functions along with _ReadWriteBarrier --- the Interlocked* functions give you the hardware memory barriers, and _ReadWriteBarrier tells the compiler to avoid reordering in the optimizer. Depending on your usage, you might also be able to get away with relying on the MSVC-specific semantics of volatile from MSVC 2005+. In the specific case of your spin_mutex, then declaring m_locked as volatile int, using InterlockedCompareExchange in try_lock and plain assignment in unlock along with appropriate uses of _ReadWriteBarrier should be OK.

lgritz commented 10 years ago

@anthonywilliams : We can't yet depend on C++11 being available. I wish we could, but our software is widely used on systems and at sites with older compilers, and are often incorporated into products (either directly, or as parts of customer plug-ins) that were built by older compilers and are hands are tied for compatibility reasons.

As we've said before, we are not depending on 'volatile' for any atomic semantics.

I believe we are already doing everything you suggest. We almost always use the gcc or Win32 intrinsics. There are only two places that we have assembly -- one to implement a "pause", and the other a memory barrier (and the latter is eliminated in the proposed patch #744).

The only real issue is that a particular construct we use -- which adds performance that we definitely want and appreciate -- uses a particular idiom that although we believe is correct, fools Thread Sanitizer into thinking that there is a data race. I'm fairly certain that there is not, but I put in a build-time switch that causes it to use a different idiom that is slightly slower but does not confuse tsan.

anthonywilliams commented 10 years ago

@lgritz I appreciate that C++11 isn't always available. For gcc, you could check the __cplusplus macro, which is defined to be 201103L if C++11 features are available, and use that as an #ifdef branch.

The race reported by TSAN is between the write to m_locked in unlock and the compare-exchange in try_lock. I agree that this should be safe on x86, due to the __asm statement you put there, but is definitely implementation-defined. I wonder if there is something you can use with TSAN to avoid the false positive without compromising your performance.

lgritz commented 10 years ago

Yes, in pull request #744, I have replaced the asm in unlock() with gcc intrinsic __sync_lock_release, with no apparent performance regression. Having another #if clause that uses C++11 functionality would merely accomplish the same thing with another syntax.

The actual remaining problem is that there is non-atomic, non-barriered read in lock() that is confusing tsan. There is a very specific purpose to this read, which is to throttle the rate at which the bus is locked by an atomic read or memory barrier, during the loop where it's waiting for another thread to release. The reason that it's correct is that this loop is just a delay until it's more likely to be free, and after it exits the no-bus-lock loop, it does a full atomic CAS (inside try_lock) to acquire the lock for real. This results in a very noticeable speed gain in times of heavy contention for the lock, up to 2x for an artificial stress test, and 2-3% in the middle of a real application. I'd still want this gain, even for users with C++11 available.

To compromise (I can't take a 3% speed hit in my app for the sake of tsan), I have provided a build-time switch (also in #744) that lets people who want to use tsan for other parts of their app compile OIIO in such a way that it they won't get this false-positive from tsan.

anthonywilliams commented 10 years ago

@lgritz Yes, using C++11 atomics just accomplishes the same thing with a different syntax, but it's also portable: the latest versions of MSVC support C++11 atomics, as does clang and other compilers will soon if they don't already. Also, if m_locked were a std::atomic<int> then you could use m_locked.load(std::memory_order_relaxed) in your backoff loop which would fix the TSAN issue with no performance cost.

Sorry I missed the read in the backoff loop. In a C++11 compiler this is technically a data race with the write in unlock, and thus undefined behaviour. On x86 it should be fine, since a read of a volatile variable has to be issued, and loads of aligned int-sized variables are atomic at the hardware level, but on other platforms you can't rely on this. Unfortunately there isn't a non-C++11 way of saying "relaxed load" with gcc __sync_* functions. On MSVC the special volatile semantics make it OK.

gchatelet commented 10 years ago

Thx for stepping in @anthonywilliams all those information are really appreciated. I'll try to take some time and provide a pull request to use c++11 when available. Would you be interested @lgritz ?

lgritz commented 10 years ago

Yes, I'd be very interested in a proposed patch that, if C++11 is available, uses std::atomic.

gchatelet commented 10 years ago

Linking here the proposed pull request

fpsunflower commented 7 years ago

Just pinging this thread again, because I'm still seeing warnings from thread sanitizer about our ustring code.

I suspect there is still something preventing it from understanding our atomic locks somehow?

AcademySoftwareFoundation / OpenImageIO

ThreadSanitizer says ustring.cpp:179 has a data race #720