SpartanJ / efsw

efsw is a C++ cross-platform file system watcher and notifier.
Other
645 stars 98 forks source link

[Bug] Crash while thread running after FileWatcherInotfiy destructed. #157

Open ashione opened 1 year ago

ashione commented 1 year ago

My main instruction likes

int main () {
FileWatcherInotify A = new FileWatcherInotify(...);
A.watch('...');
delete A;
return 0;
}

There are two threads cross running in process,

We found watcher thread crashed and this backtrace remained:

#4  0x00007f78d88248f3 in uw_update_context () from /usr/lib64/libgcc_s.so.1
#5  0x00007f78d88254ae in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#6  0x00007f78d8759946 in backtrace () from /usr/lib64/libc.so.6
#7  0x00007f78da46f8cf in google::GetStackTrace(void**, int, int) () from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#8  0x00007f78da4707f9 in google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) ()
   from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#9  <signal handler called>
#10 0x00007f7876d34f2d in efsw::Private::ThreadMemberFunc<efsw::FileWatcherInotify>::run (this=<error reading variable: Cannot access memory at address 0x27f7773ffe798>)
    at bazel-out/k8-dbg/bin/external/com_github_efsw/_virtual_includes/efsw/efsw/Thread.hpp:78

main thread stack :

#0  0x00007f78d862cac7 in __pthread_timedjoin_ex () from /usr/lib64/libpthread.so.0
#1  0x00007f7876d43541 in efsw::Platform::ThreadImpl::wait (this=0x7f777cfa36a0) at external/com_github_efsw/src/efsw/platform/posix/ThreadImpl.cpp:25
#2  0x00007f7876d3de3b in efsw::Thread::wait (this=0x7f777c14c2a0) at external/com_github_efsw/src/efsw/Thread.cpp:22
#3  0x00007f7876d3dd48 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:9
#4  0x00007f7876d3dd98 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:12
#5  0x00007f7876d2cf0b in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:50
#6  0x00007f7876d2d0d6 in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:68
#7  0x00007f7876d2b351 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:67
#8  0x00007f7876d2b378 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:68

According to above stackes, it's watcher deletion happened before watcher entry_point running, which means m_objest had been deleted and memory undetected in that time

// Specialization using a member function
template <typename C> struct ThreadMemberFunc : ThreadFunc {
    ThreadMemberFunc( void ( C::*function )(), C* object ) :
        m_function( function ), m_object( object ) {}
    virtual void run() { ( m_object->*m_function )(); }
    void ( C::*m_function )();
    C* m_object;
};
SpartanJ commented 1 year ago

Hi! Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly. I'm unable to reproduce the bug with a minimal example:

#include <efsw/efsw.hpp>

int main( int argc, char** argv ) {
    efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
    fileWatcher->watch();
    delete fileWatcher;
    return 0;
}

Neither with a watched dir:

#include <efsw/efsw.hpp>
#include <iostream>

class UpdateListener : public efsw::FileWatchListener {
  public:
    UpdateListener() {}

    void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,
                           std::string ) {}
};

int main( int argc, char** argv ) {
    efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
    UpdateListener* ul = new UpdateListener();
    fileWatcher->addWatch( ".", ul );
    fileWatcher->watch();
    delete fileWatcher;
    delete ul;
    return 0;
}

I'm missing something?

ashione commented 1 year ago

Hi!

Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly.

I'm unable to reproduce the bug with a minimal example:


#include <efsw/efsw.hpp>

int main( int argc, char** argv ) {

  efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

  fileWatcher->watch();

  delete fileWatcher;

  return 0;

}

Neither with a watched dir:


#include <efsw/efsw.hpp>

#include <iostream>

class UpdateListener : public efsw::FileWatchListener {

  public:

  UpdateListener() {}

  void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,

                         std::string ) {}

};

int main( int argc, char** argv ) {

  efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

  UpdateListener* ul = new UpdateListener();

  fileWatcher->addWatch( ".", ul );

  fileWatcher->watch();

  delete fileWatcher;

  delete ul;

  return 0;

}

I'm missing something?

I just want to make someone getting main code path and why it crashed,which does not means i use filewatcherinotfiy directly. You might catch this from stack filewatcher was used.

ashione commented 1 year ago

In my usage scenario, it will appear once about an hour running. Especially in the case of high load machine, pthread_create returns does not mean that the thread has been created and run entry point function successfully. Actually you can deduce it from the way the code runs, which is indeed possible. @SpartanJ

SpartanJ commented 1 year ago

Oh, I understand. I thought that your code snipped was a literal example. If I understand well mIsActive should be only true when entryPoint is called / running and it should be atomic. Do you agree?

SpartanJ commented 1 year ago

Let me know if the last commit helps with your issue.

ashione commented 1 year ago

Let me know if the last commit helps with your issue.

I pack this commit to my program but still crashed. So perhas it not works yet.

SpartanJ commented 1 year ago

Sorry, the fix I sent was for another kind of error and also was worst than before. I misinterpreted the crash completely. I think that now I have a clear picture of the issue but I have no idea how I can avoid it in 100% of the cases. Reading the Linux documentation says basically that the thread can start at any time, I'm assuming that in your case is taking enough time to be called so late that the owner (the Thread instance and the ThreadImpl instance) does not exist anymore. I assumed that the call to pthread_join should prevent that (the Thread destructor calls to pthread_join before destroying ThreadImpl and its own instance), but it seems that pthread_join does nothing if the thread did not start. I sent another change that will check if the join worked, and if that didn't work will call to a pthread_cancel. If that doesn't work, I actually have no idea how this could be fixed. In your use case, you're spawning many efsw::FileWatcher and destroying them very quickly or something alike? Otherwise, I don't see how this could happen at all.

SpartanJ commented 1 year ago

Also: I tried to replicate it by doing something stupid like:

int main( int argc, char** argv ) {
    int c = 0;
    UpdateListener* ul = new UpdateListener();
    while ( c++ < 10000 ) {
        efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
        fileWatcher->addWatch( ".", ul );
        fileWatcher->watch();
        delete fileWatcher;
    }

    delete ul;
    return 0;
}

And pthread_join is doing its job, it always waits for the entryPoint to run at completion. That confuses me more.

ashione commented 1 year ago

I'm going to figure out the root cause of this issue and upload lastest clues ASAP.

maddanio commented 9 months ago

Maybe using c++11 threads helps?

maddanio commented 9 months ago

maybe try my PR that uses c++11 primitives for threading and locking: https://github.com/SpartanJ/efsw/pull/170

SpartanJ commented 9 months ago

Thanks for the collaboration. It seems that still needs some work to compile properly. You had this same issue? Because I wasn't able to reproduce it yet.

maddanio commented 9 months ago

what needs work? I have no problem comliling?

maddanio commented 9 months ago

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

ashione commented 8 months ago

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

Perphas it's better way to use standard thread library. Actually it's weird and no more detail information or core dump can be digged out. I notice that filewatherinotify and some objects might not be threadsafe in high frequency creation or destruction.

SpartanJ commented 8 months ago

I agree, efsw was using custom thread implementation because it was created actually before C++11 and I kept compatibility for older compilers until just recently. But know it's possible to migrate everything to C++11 standard headers. The PR looks OK but I'll not merge it until I have enough time to test it on all platforms.