OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.
http://www.openzwave.net/
GNU Lesser General Public License v3.0
1.05k stars 917 forks source link

pthread not joined properly and thereby leaking memory #583

Open CurlyMoo opened 9 years ago

CurlyMoo commented 9 years ago

Version: 2015-05-28 23:26:31.852 Always, OpenZwave Version 1.2.919 Starting Up

==13737==
==13737== HEAP SUMMARY:
==13737==     in use at exit: 304 bytes in 1 blocks
==13737==   total heap usage: 6,769 allocs, 6,768 frees, 627,951 bytes allocated
==13737==
==13737== 304 bytes in 1 blocks are possibly lost in loss record 1 of 1
==13737==    at 0x4C2AD10: calloc (vg_replace_malloc.c:623)
==13737==    by 0x4010FD1: allocate_dtv (dl-tls.c:296)
==13737==    by 0x40116DD: _dl_allocate_tls (dl-tls.c:460)
==13737==    by 0x534CC27: allocate_stack (allocatestack.c:589)
==13737==    by 0x534CC27: pthread_create@@GLIBC_2.2.5 (pthread_create.c:495)
==13737==    by 0x608095: OpenZWave::ThreadImpl::Start(void (*)(OpenZWave::Event*, void*), OpenZWave::Event*, void*) (ThreadImpl.cpp:89)
==13737==    by 0x5E009F: OpenZWave::Thread::Start(void (*)(OpenZWave::Event*, void*), void*) (Thread.cpp:76)
==13737==    by 0x57F471: OpenZWave::Driver::Start() (Driver.cpp:346)
==13737==    by 0x4C5524: OpenZWave::Manager::AddDriver(std::string const&, OpenZWave::Driver::ControllerInterface const&) (Manager.cpp:278)
==13737==    by 0x4BD0E7: zwaveHwInit() (zwave.cpp:272)
==13737==    by 0x4907D7: start_pilight (daemon.c:2486)
==13737==    by 0x490A9B: main (daemon.c:2727)
==13737==
==13737== LEAK SUMMARY:
==13737==    definitely lost: 0 bytes in 0 blocks
==13737==    indirectly lost: 0 bytes in 0 blocks
==13737==      possibly lost: 304 bytes in 1 blocks
==13737==    still reachable: 0 bytes in 0 blocks
==13737==         suppressed: 0 bytes in 0 blocks
==13737==
==13737== For counts of detected and suppressed errors, rerun with: -v
==13737== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Relevant code snippets:

Initialize:

    OpenZWave::Options::Get()->AddOptionBool("logging", true);
    OpenZWave::Options::Get()->AddOptionInt("SaveLogLevel", OpenZWave::LogLevel_Warning);
    OpenZWave::Options::Get()->AddOptionInt("PollInterval", 500);
    OpenZWave::Options::Get()->AddOptionBool("IntervalBetweenPolls", true);
    OpenZWave::Options::Get()->AddOptionBool("AppendLogFile", true);
    OpenZWave::Options::Get()->AddOptionBool("ConsoleOutput", true);
    OpenZWave::Options::Get()->Lock();
    OpenZWave::Manager::Create();
    OpenZWave::Manager::Get()->AddWatcher(OnNotification, NULL);
    OpenZWave::Manager::Get()->AddDriver(com);

Deinitialize:

    OpenZWave::Manager::Get()->RemoveWatcher(OnNotification, NULL);
    OpenZWave::Manager::Destroy();
    OpenZWave::Options::Destroy();

Adding a small check in bool ThreadImpl::Terminate tells me that this function is never called. There is another issue in this function, but i will get to that later.

To confirm the (possible) leak, i added a pthread_detach after the pthread_create in bool ThreadImpl::Start making the join obsolete. This resolves the leak altogether.

So the issue here is either:

There is however still another issue in the code. A pthread_cancel followed by a pthread_join can leak memory when the running thread is not gracefully stopped (and could not garbage collect its own processes). The Thread class should somehow watch if a thread is terminated before the pthread_join is called.

A small snippet how i do this in C:

static int threads = 0;
static int loop = 1;
void *thread(void *) {
   [lock mutex]
   threads++;
   [unlock mutex]

   [code logic with loop / conditional wait / etc.]

   [lock mutex]
   threads--;
   [unlock mutex]
}

void gc(void) {
   loop = 0;
   [unlock mutex]
   [signal mutex]
   while(threads > 0) {
      usleep(10);
   }
   [join thread]
}
Fishwaldo commented 9 years ago

You should be calling Manager::Get()->RemoveDriver() when shutting down before the Manager::Destroy() (but that leads to a race condition sometimes)

That will gracefully send any queued up notifications etc and should terminate the threads you are seeing hanging around. (Thread->Stop() will signal the thread to exit, hang around for 2 seconds if it doesn't gracefully exit and then call Thread->Terminate())

if doing it that way, I dont see any leaks in valgrind.

(A Manager::Destroy() should be called after that - It forcefully "destroys" everything without giving much of a change for anything to cleanup or flush queues etc).

CurlyMoo commented 9 years ago

You should be calling Manager::Get()->RemoveDriver() when shutting down before the Manager::Destroy() (but that leads to a race condition sometimes)

Why is there actually a possibility of a race condition? Shouldn't you be sure there can't be any? I don't feel invited to call RemoveDriver if there is this chance.

(Thread->Stop() will signal the thread to exit, hang around for 2 seconds if it doesn't gracefully exit and then call Thread->Terminate())

Again, you should just know when a thread is stopped not give it a certain time before it has to stop. Even if that takes 3 seconds. If you think it could be still hanging around after 2 seconds then there is some fault in the threaded function. IMHO programmers should never need a pthread_cancel except in absolute emergency situations.

My example clearly shows how this could be properly implemented.

Fishwaldo commented 9 years ago

The race condition exists on the boundry between OZW and the application. Try to follow me here: Notifications are actually Queued up and sent the Application by the main loop in the Driver Thread. When we shutdown the Driver Thread, we flush the Notification Queue, and all the Notifications get delivered to the application (#1), then the driver thread continues to deallocate as well. If any of the Application Notification Handlers call back to the Manager::Get()->whatever functions while the Driver Thread is being destroyed (eg, you do a Manager::Get->GetValueIDLabel or whatnot) then the Driver Class (and thread) could be in various stages of destruction and either throw a exception (HomeID doesn't exist) or return "incomplete" data.

So ideally (as you have done above) is to remove the Notifications before destroying the Driver Thread, but then you would manually have to remove nodes etc from your application rather than handling the NodeRemoved Notifications etc.

As for the timeout - Welcome to the wonderful world of cross platform, where we don't have the luxury of coding for just Linux. The generic code has to deal with Windows threading as well... (and there is some work on Android that could further complicate the threading classes in the future).

As Note 2 indicated above, the reason we give 2 second timeouts here before forcefully destroying the thread is because we are event driven to the application via Callbacks, writing out config files, flushing buffers etc. The timeout is to ensure that we never "block" because of a poor implementation. I havn't seen anything that actually takes longer than 2 seconds though, so its just a "it works for us" number.

CurlyMoo commented 9 years ago

So ideally (as you have done above) is to remove the Notifications before destroying the Driver Thread, but then you would manually have to remove nodes etc from your application rather than handling the NodeRemoved Notifications etc.

Ok, the removal of nodes is not a problem, but in my implementation the memory issue still occurs. You say that is fixed by calling Manager::Get()->RemoveDriver() before OpenZWave::Manager::Destroy();? As long as the Manager isn't used afterwards no race condition should occur.

As for the timeout - Welcome to the wonderful world of cross platform, where we don't have the luxury of coding for just Linux.

I use winpthread (on Windows) in my cross-platform applications and have good experience with it.