cmu-db / peloton

The Self-Driving Database Management System
http://pelotondb.io
Apache License 2.0
2.04k stars 623 forks source link

Tensorflow background thread problem #1232

Open saatviks opened 6 years ago

saatviks commented 6 years ago

Tensorflow doesnt expose any API to shut down background threads when closing a Tensorflow session. It has a currently open issue regarding this. As a result Valgrind(and possibly the Address Sanitizer) is bound to throw errors related to this.

There's nothing much we can do until Tensorflow releases a relevant API for the same.

saatviks commented 6 years ago

@pervazea @malin1993ml I'm curious about the correct way to handle suppressions to ensure that I supress the correct set of leaks(and future genuine memory leaks dont go unnoticed). I think the current way of supressingtf_session_entity.cpp is incorrect since future genuine user introduced memory leaks might go unnoticed.

It seems that a regex like tensorflow::thread would be more suitable? I'm not confident though that this would cover all possible cases.

PFA the Valgrind(from my machine) log and Address Sanitizer(from Travis) log. For more context, these logs are generated by running the tensorflow_test from #1248 . I can confirm that the leak should introduce 2 Valgrind errors in the No-op case due to the linked bug.

What do you think would be the best way to proceed with this?

address_sanitizer_log.txt valgrind_log.txt

pervazea commented 6 years ago

I don't know that I have a good answer, without studying the issue. Even after that, this is going to be imprecise.

In general, one wants to suppress only specific leaks, so that suppression should identify the leak location as precisely as possible. One could:

  1. Go as far down the stack as possible (which is is what we've done with tf_session_entity.cpp). Going even further down is possible, but transitions to pre-compiled library rather than source code. In some of the cases we've fixed, this (the lowest level) has not been the real leak location.

  2. Go up the stack, examining source code, until one finds the real place where something is allocated (but never freed). This requires reading and understanding the code.

  3. If the leaks end up being from binaries, then it will be difficult. Will likely have to suppress broadly in the binary.

  4. For a tensorflow::thread suppression, I'd be concerned that it is too general a suppression.

So, looking just at the reported stack traces, is going to be insufficient. Someone familiar with the code (or should study the code enough to be familiar) should examine the stack locations. If we can identify the real location of the leak, then we can define a reliable suppression regex.

saatviks commented 6 years ago

Thanks @pervazea and @malin1993ml - based on your suggestions have set the following as the new leak supression for this bug:

+leak:tensorflow::thread::EigenEnvironment::CreateThread
+leak:tensorflow::thread::ThreadPool::ThreadPool
+leak:__cxa_thread_atexit_impl