Ramki-Ravindran / thread-sanitizer

Automatically exported from code.google.com/p/thread-sanitizer
0 stars 0 forks source link

SEGV using external symbolizer #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
I get this occasionally when running my project's functional tests under TSAN.

What is the expected output? What do you see instead?

After several successfully printed backtraces, I will sometimes get the 
following segfault:

ESC[mThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
ThreadSanitizer WARNING: mutex unlock by another thread
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Can't read from symbolizer at fd 4
==20880==WARNING: Failed to use and restart external symbolizer!
==20880==WARNING: Can't read from symbolizer at fd 4
*** Aborted at 1395709460 (unix time) try "date -d @1395709460" if you are 
using GNU date ***
PC: @     0x7f8e98e2c59e __sanitizer::Symbolizer::SendCommand()
*** SIGSEGV (@0x4010) received by PID 20880 (TID 0x7f8e8fed8880) from PID 
16400; stack trace: ***
    @     0x7f8e98e3b464 rtl_sigaction()
    @     0x7f8e962a0bb0 (unknown)
    @     0x7f8e98e2c59e __sanitizer::Symbolizer::SendCommand()
    @     0x7f8e98e2be83 __sanitizer::Symbolizer::SymbolizeCode()
    @     0x7f8e98e69ab8 __tsan::SymbolizeCode()
    @     0x7f8e98e6580d __tsan::SymbolizeStack()
    @     0x7f8e98e677b2 __tsan::PrintCurrentStack()
    @     0x7f8e98e64ad4 __tsan::MutexUnlock()
    @     0x7f8e98e34038 pthread_mutex_unlock
    @     0x7f8e99037aed boost::timed_mutex::unlock()

What version of the product are you using? On what operating system?

Using the TSAN built into llvm/clang 3.4:

$ clang -v
Ubuntu clang version 3.4-1ubuntu1 (trunk) (based on LLVM 3.4)
Target: x86_64-pc-linux-gnu
Thread model: posix

Please provide any additional information below.

The test is fairly concurrent, and currently triggers a lot of "mutex unlock by 
another thread" errors. My guess is that the error reporting mechanism has a 
thread safety issue.

Original issue reported on code.google.com by tlip...@gmail.com on 25 Mar 2014 at 1:10

GoogleCodeExporter commented 9 years ago
FWIW I never seem to get this using the addr2line symbolizer (though the 
addr2line one often hangs, per my bug filed at 
http://llvm.org/bugs/show_bug.cgi?id=18379)

Original comment by tlip...@gmail.com on 25 Mar 2014 at 1:11

GoogleCodeExporter commented 9 years ago
Attaching a repro test case. With the external symbolizer this segfaults 
reliably on my machine. (requires gtest and boost, but could just as easily do 
without)

Original comment by tlip...@gmail.com on 25 Mar 2014 at 1:29

Attachments:

GoogleCodeExporter commented 9 years ago
Seems like this is actually the same root cause as issue #45 -- the mutex bugs 
aren't reported using the normal ReportRace() call and those don't take 
CommonSanitizerReportMutex.Lock. This means you can't suppress them and also 
that multiple bugs may be reported concurrently and screw up the state of the 
symbolizer.

Original comment by tlip...@gmail.com on 25 Mar 2014 at 1:42

GoogleCodeExporter commented 9 years ago

Original comment by konstant...@gmail.com on 25 Mar 2014 at 6:54

GoogleCodeExporter commented 9 years ago
Thanks for the report and the suggestion in #3!
It' looks like a likely cause of the crash.
Meanwhile, if you fix the "mutex unlock by another thread", does it fix the 
crash?

Original comment by dvyu...@google.com on 25 Mar 2014 at 1:53

GoogleCodeExporter commented 9 years ago
Haven't tried that, since actually fixing the problem is somewhat difficult in 
our application. Our rwlock implementation is actually based around CAS 
spinning, so we know it is actually safe to be used in this unconventional 
manner. So, if what you're asking is whether the crash is caused by TSAN or by 
the code itself, I'm certain it's a TSAN issue. (we run the same tests 
constantly with ASAN and never had a problem).

I do have a partial fix for our application which reduces the number of errors 
reported by TSAN, and I can confirm that with that fix, the TSAN runs fail 
significantly less often.

Original comment by tlip...@gmail.com on 25 Mar 2014 at 7:56

GoogleCodeExporter commented 9 years ago
>Our rwlock implementation is actually based around CAS spinning, so we know it 
is actually safe to be used in this unconventional manner.

Why does tsan think that this is a mutex? Have you manually annotated it as 
mutex using tsan annotations? If so, it's incorrect. Since it's a semaphore. If 
you are using atomic operations than tsan must automatically understand the 
synchronization and you don't need annotations.

Original comment by dvyu...@google.com on 25 Mar 2014 at 7:58

GoogleCodeExporter commented 9 years ago
> Why does tsan think that this is a mutex? Have you manually annotated it as 
mutex using tsan annotations? If so, it's incorrect. Since it's a semaphore. If 
you are using atomic operations than tsan must automatically understand the 
synchronization and you don't need annotations

That's a fair point. In some parts of the code it's used as a mutex, and in 
other parts of the code it's used as a semaphore (since we know it's 
implementation is safe to be used that way). Given that, I had annotated it as 
a RWLOCK. Perhaps it's worth duplicating the code and only annotating one of 
the two implementations, per your point.

Nevertheless, this is still a TSAN bug, right? :)

Original comment by tlip...@gmail.com on 25 Mar 2014 at 8:01

GoogleCodeExporter commented 9 years ago
Yes, there are definitely something to fix in tsan.

I am just trying to find a workaround, so that you can continue using tsan.

Original comment by dvyu...@google.com on 25 Mar 2014 at 8:05

GoogleCodeExporter commented 9 years ago
I attempted to duplicate the implementation and only annotate one as an RWLOCK. 
This gets rid of the mutex related warnings, but now I get data races.

Any advice for debugging? Setting force_seq_cst_atomic=1 got rid of the 
warning, but that implies I've got something going wrong. After looking at the 
barriers for an hour or so and even going so far as adding superfluous 
MemoryBarrier() instructions, I've still come up empty.

I've attached my implementation (simplified a bit) and the test which generates 
a TSAN error:
WARNING: ThreadSanitizer: data race (pid=1971)
  Write of size 4 at 0x7fffe1dada60 by thread T11:
    #0 Writer(SharedState*) /home/todd/git/kudu/src/util/rw_semaphore-test.cc:24 (exe+0x0000000a00ca)
    #1 void boost::_bi::list1<boost::_bi::value<SharedState*> >::operator()<void (*)(SharedState*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(SharedState*), boost::_bi::list0&, int) /usr/include/boost/bind/bind.hpp:253 (exe+0x0000000a40d7)
    #2 boost::_bi::bind_t<void, void (*)(SharedState*), boost::_bi::list1<boost::_bi::value<SharedState*> > >::operator()() /usr/include/boost/bind/bind_template.hpp:20 (exe+0x0000000a4061)
    #3 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(SharedState*), boost::_bi::list1<boost::_bi::value<SharedState*> > > >::run() /usr/include/boost/thread/detail/thread.hpp:117 (exe+0x0000000a3290)
    #4 boost::this_thread::interruption_requested() ??:0 (libboost_thread.so.1.54.0+0x00000000ba49)

  Previous read of size 4 at 0x7fffe1dada60 by thread T8:
    #0 Reader(SharedState*) /home/todd/git/kudu/src/util/rw_semaphore-test.cc:34 (exe+0x0000000a0144)
    #1 void boost::_bi::list1<boost::_bi::value<SharedState*> >::operator()<void (*)(SharedState*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(SharedState*), boost::_bi::list0&, int) /usr/include/boost/bind/bind.hpp:253 (exe+0x0000000a40d7)
    #2 boost::_bi::bind_t<void, void (*)(SharedState*), boost::_bi::list1<boost::_bi::value<SharedState*> > >::operator()() /usr/include/boost/bind/bind_template.hpp:20 (exe+0x0000000a4061)
    #3 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(SharedState*), boost::_bi::list1<boost::_bi::value<SharedState*> > > >::run() /usr/include/boost/thread/detail/thread.hpp:117 (exe+0x0000000a3290)
    #4 boost::this_thread::interruption_requested() ??:0 (libboost_thread.so.1.54.0+0x00000000ba49)

I don't necessarily expect you to debug my code ;-) But this might make a good 
example for where manual annotations are needed beyond just using atomic ops, 
if indeed I'm not making a mistake here.

Original comment by tlip...@gmail.com on 26 Mar 2014 at 2:18

Attachments:

GoogleCodeExporter commented 9 years ago
This must be Acquire_Load:

  void WaitPendingReaders() {
    int loop_count = 0;
    while ((base::subtle::NoBarrier_Load(&state_) & kNumReadersMask) > 0) {
      boost::detail::yield(loop_count++);
    }
  }

because this function effectively denotes enter into the critical section, and 
it must synchronize with the previous readers.
With the current code, writers synchronize only with past readers, but not with 
readers that are currently in critical sections.

Consequently, lock() can use NoBarrier_CompareAndSwap.

Original comment by dvyu...@google.com on 26 Mar 2014 at 7:42

GoogleCodeExporter commented 9 years ago
Aha, that was it. Thanks, Dmitry! Appreciate the help.

Original comment by tlip...@gmail.com on 26 Mar 2014 at 5:08

GoogleCodeExporter commented 9 years ago
http://llvm.org/viewvc/llvm-project?view=revision&revision=207204
fixes "double lock of a mutex" report.

Original comment by dvyu...@google.com on 25 Apr 2014 at 7:50

GoogleCodeExporter commented 9 years ago
http://llvm.org/viewvc/llvm-project?view=revision&revision=207206
fixes "mutex unlock by another thread" report.

Original comment by dvyu...@google.com on 25 Apr 2014 at 8:02

GoogleCodeExporter commented 9 years ago
Other relevant commits: 207207, 207208, 207209, 207211.
I think I've fixed them all.

Original comment by dvyu...@google.com on 25 Apr 2014 at 9:23