Open martinetd opened 2 years ago
operator!()
is a null pointer check, so it won't be that simple.
It's probably a BEESNOTE
macro that references a Fd object which is assigned while the BEESNOTE
macro is holding it and keeping it visible to the status thread. e.g.
thread 1:
Fd foo;
BEESNOTE("Doing a thing with " << foo); // saves a functor that holds a ref to 'foo'
foo = do_some_stuff(); // modifies foo
thread 2 is the bees status thread, which calls BEESNOTE's functor. BEESNOTE's status object creation and visibility to the status thread is protected by a mutex, but access to the shared_ptr
in the assignment to foo is not. If the status thread invokes the function at the wrong time, operator!() will race with the assignment to foo
and cause a crash. This bug existed in several places in bees so far, and it's possible I haven't found all of them yet.
The functor is in the stack trace:
#2 0x000055820b495b99 in std::function<void (std::ostream&)>::operator()(std::ostream&) const (
__args#0=..., this=0x7fb8721a7350) at /usr/include/c++/11/bits/std_function.h:590
Unfortunately gdb is giving us the very unhelpful definition of the std::function
template, but we need to know the point in the code where that template is instantiated. If you can convince gdb to tell us which BEESNOTE
instance this belongs to, it should be an easy fix.
operator!() is a null pointer check, so it won't be that simple.
It's a null pointer check for an attribute of m_handle, not for m_handle itself -- I've been running with this patch for a few days:
diff --git a/lib/fd.cc b/lib/fd.cc
index 2f62567afdbc..f546d9bbcdb1 100644
--- a/lib/fd.cc
+++ b/lib/fd.cc
@@ -153,7 +153,7 @@ namespace crucible {
bool
Fd::operator!() const
{
- return m_handle->get_fd() < 0;
+ return !m_handle || m_handle->get_fd() < 0;
}
shared_ptr<IOHandle>
Unfortunately gdb is giving us the very unhelpful definition of the std::function template, but we need to know the point in the code where that template is instantiated. If you can convince gdb to tell us which BEESNOTE instance this belongs to, it should be an easy fix.
I don't have time right now, I'll have a look early next week -- thanks for the pointer!
That patch probably does make it much harder to lose the race and hit the bug, but the real fix is to find where the invalid access is. "m_handle
is not null" is an object invariant so it should never be NULL. Good hunting!
Took a bit longer than planned, sorry for the delay.
So, from what I can see, there's another thread calling the destructor of that BeesNote get_status()
is iterating on when the crash happens:
(gdb) bt
#0 futex_wait (private=0, expected=2, futex_word=0x55820b504a20 <BeesNote::s_mutex>)
at ../sysdeps/nptl/futex-internal.h:146
#1 __lll_lock_wait (futex=futex@entry=0x55820b504a20 <BeesNote::s_mutex>, private=0)
at lowlevellock.c:52
#2 0x00007fb873717fe3 in __GI___pthread_mutex_lock (mutex=0x55820b504a20 <BeesNote::s_mutex>)
at ../nptl/pthread_mutex_lock.c:80
#3 0x000055820b494bd6 in __gthread_mutex_lock (__mutex=0x55820b504a20 <BeesNote::s_mutex>)
at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#4 std::mutex::lock (this=0x55820b504a20 <BeesNote::s_mutex>)
at /usr/include/c++/11/bits/std_mutex.h:100
#5 std::unique_lock<std::mutex>::lock (this=<synthetic pointer>)
at /usr/include/c++/11/bits/unique_lock.h:139
#6 std::unique_lock<std::mutex>::unique_lock (__m=..., this=<synthetic pointer>)
at /usr/include/c++/11/bits/unique_lock.h:69
#7 BeesNote::~BeesNote (this=this@entry=0x7fb8721a7350, __in_chrg=<optimized out>)
at bees-trace.cc:92
#8 0x000055820b45aac2 in BeesContext::scan_forward (this=<optimized out>, bfr_in=...)
at bees-context.cc:795
#9 0x000055820b47f45a in operator() (__closure=<optimized out>)
at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#10 std::__invoke_impl<void, BeesRoots::crawl_batch(std::shared_ptr<BeesCrawl>)::<lambda()>&> (
__f=...) at /usr/include/c++/11/bits/invoke.h:61
#11 std::__invoke_r<void, BeesRoots::crawl_batch(std::shared_ptr<BeesCrawl>)::<lambda()>&> (__fn=...)
at /usr/include/c++/11/bits/invoke.h:154
#12 std::_Function_handler<void(), BeesRoots::crawl_batch(std::shared_ptr<BeesCrawl>)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/11/bits/std_function.h:290
#13 0x000055820b4a8a33 in std::function<void ()>::operator()() const (this=this@entry=0x7fb8721a76f0)
at /usr/include/c++/11/bits/std_function.h:590
#14 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55
#15 0x000055820b4c64cf in crucible::TaskState::exec (this=0x7fb8500bbf80) at task.cc:305
#16 0x000055820b4a8a33 in std::function<void ()>::operator()() const (this=this@entry=0x7fb8721a7980)
at /usr/include/c++/11/bits/std_function.h:590
#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55
#18 0x000055820b4c6c91 in crucible::TaskConsumer::consumer_thread (this=0x55820d052bd0) at task.cc:739
#19 0x00007fb8735c58d3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007fb873715d80 in start_thread (arg=0x7fb8721a9640) at pthread_create.c:481
#21 0x00007fb8732ae76f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It's waiting on BeesNote::s_mutex
to remove itself from BeesNote::s_status
(it's the same dump as the original post, you can see 0x7fb8721a7350
as parameter of the std::function
)
This happens at the end of BeesContext::scan_forward
so it's just normal destructors happening at the end of the function; the problem would be that it decided to free/uninit the m_handle before removing the note from the list here.
(we can also see scan_forward in std::function scope, the function called is this whatever that means:
(gdb) p _M_invoker
$5 = (std::function<void(std::basic_ostream<char, std::char_traits<char> >&)>::_Invoker_type) 0x55820b449090 <std::_Function_handler<void(std::basic_ostream<char, std::char_traits<char> >&), BeesContext::scan_forward(const BeesFileRange&)::<lambda(std::ostream&)> >::_M_invoke(const std::_Any_data &, std::basic_ostream<char, std::char_traits<char> > &)>
anyway, it's just belatedly processing the << statements, and << does !!bfr.m_fd
so calls !
and this explodes)
Unfortunately this all doesn't tell us -which- note it is, there are a couple of << bfr*
notes in scan_forward
which could match and I couldn't print the text itself, but either way the bfr are not explicitly freed before the end of the function so we're still in an ordering problem: for some reason the BeesFileRange m_fd got cleaned up before the note?
Looking at scan_forward disass output, I see they call delete on what I think is bfr_in just before calling the note destructor so that'd be a nifty candidate, but I'll admit I was too lazy to follow this properly so might have made a mistake.
Anyway, I honestly have no idea how to do enforce ordering (I'd have assumed c++ would be smart enough to notice we use the bfr in the note so to deallocate in opposite order...), if the bfd has actually been detroyed the memory is up for grabs and my "fix" is as you say just as problematic (use after free).
Not quite sure how to proceed here, any idea?
Hello,
I'm getting segfaults in
crucible::Fd::operator!
on current master ( fbf6b395c830edd7ade1d0042058e94531769962 ), running with--workaround-btrfs-send
as I have snapshots I backup with btrfs send. I didn't try without the workaround, and didn't run bees before on these machines.Here's
bt
andbt full
with symbols:Here's the last few lines of logs for this precise occurrence (verbose 8), doesn't look very useful to me:
I didn't investigate further, but can if you'd like me to; bees seems to crash after at most a few hours so it shouldn't be too much trouble to add extra logs or anything if you want more infos before fixing (just adding a null pointer check is probably trivial enough)
Thanks!