galeone / tfgo

Tensorflow + Go, the gopher way
https://pgaleone.eu/tensorflow/go/2017/05/29/understanding-tensorflow-using-go/
Apache License 2.0
2.41k stars 154 forks source link

Potential memory leak on reloading model #72

Open jriegner opened 1 year ago

jriegner commented 1 year ago

Hey guys,

We use tfgo and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.

The code to load the model is

// load model into memory
    model := tg.LoadModel(
        "path/to/our/model",
        []string{
            "serve",
        },
        &tf.SessionOptions{},
    )

But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with pprof and could not see that any of the internal components in our code has a significantly growing memory usage.

Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with memleak-bpfcc from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leaked

    1770048 bytes in 9219 allocations from stack
        operator new(unsigned long)+0x19 [libstdc++.so.6.0.28]
        google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::New(google::protobuf::Arena*)+0x1c [libtensorflow_framework.so.2]
        google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::NewFromPrototype(tensorflow::NodeDef const*, google::protobuf::Arena*)+0x20 [libtensorflow_framework.so.2]
        google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type*)+0xc2 [libtensorflow_framework.so.2]
        google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::Add()+0x21 [libtensorflow_framework.so.2]
        tensorflow::FunctionDef::add_node_def()+0x20 [libtensorflow_framework.so.2]
        tensorflow::FunctionDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x334 [libtensorflow_framework.so.2]
        bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDef>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDef*)+0x64 [libtensorflow_framework.so.2]
        tensorflow::FunctionDefLibrary::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x240 [libtensorflow_framework.so.2]
        bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDefLibrary>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDefLibrary*)+0x64 [libtensorflow_framework.so.2]
        tensorflow::GraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x291 [libtensorflow_framework.so.2]
        bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::GraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::GraphDef*)+0x64 [libtensorflow_framework.so.2]
        tensorflow::MetaGraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x325 [libtensorflow_framework.so.2]
        bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::MetaGraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::MetaGraphDef*)+0x64 [libtensorflow_framework.so.2]
        tensorflow::SavedModel::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x25b [libtensorflow_framework.so.2]
        google::protobuf::MessageLite::MergeFromCodedStream(google::protobuf::io::CodedInputStream*)+0x32 [libtensorflow_framework.so.2]
        google::protobuf::MessageLite::ParseFromCodedStream(google::protobuf::io::CodedInputStream*)+0x3e [libtensorflow_framework.so.2]
        tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)+0x141 [libtensorflow_framework.so.2]
        tensorflow::(anonymous namespace)::ReadSavedModel(absl::lts_20211102::string_view, tensorflow::SavedModel*)+0x136 [libtensorflow_framework.so.2]
        tensorflow::ReadMetaGraphDefFromSavedModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::MetaGraphDef*)+0x5d [libtensorflow_framework.so.2]
        tensorflow::LoadSavedModelInternal(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0x41 [libtensorflow_framework.so.2]
        tensorflow::LoadSavedModel(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0xc0 [libtensorflow_framework.so.2]
        TF_LoadSessionFromSavedModel+0x2a8 [libtensorflow.so]
        _cgo_6ae2e7a71f9a_Cfunc_TF_LoadSessionFromSavedModel+0x6e [testapp]
        runtime.asmcgocall.abi0+0x64 [testapp]
        github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel.abi0+0x4d [testapp]
        github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2+0x14f [testapp]
        github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel+0x2b6 [testapp]
        github.com/galeone/tfgo.LoadModel+0x6d [testapp]
        main.reloadModel+0x276 [testapp]
        main.main+0x72 [testapp]
        runtime.main+0x212 [testapp]
        runtime.goexit.abi0+0x1 [testapp]

As you can see this stacktrace shows calls to tfgo and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak in tfgo or tensorflow itself.

Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in tfgo? If you need more information on this, please tell me.

Thanks in advance :)

galeone commented 1 year ago

Hi @jriegner , thank you for this report!

You're right, there's a leak. And I guess I found where the leak is :smile:

LoadModel returns a *tfgo.Model that contains a tensorflow a *tf.SavedModel. This tf.SavedModel has a *tf.Session field.

When the model is deleted, the session is not automatically closed (it's the old tf.Session of Python - if you are familiar with TensorFlow < 2), and thus I guess the leak is in the missing invocation of session.Close().

I guess we have 2 options:

  1. Expose the session and let the user invoke the Close() mehtod (I don't like it)
  2. Create a finalizer that automatically invokes the session.Close() when the tfgo.Model is garbage collected.

I kinda like more the second approach.

Right now I'm travelling so I don't know when I'll be able to work on it (I'm setting up the development environment right now while I'm on a train lol) - if you want to implement the second option, I'll be more then happy to review and merge it

galeone commented 1 year ago

Update: I guess I made it https://github.com/galeone/tfgo/commit/b53620202dc55ebc2f12a299e65837f51737f83b

Give it a try and let me know if it works

jriegner commented 1 year ago

Hey, thanks for your fast reply!

I ran my local test app with the updated code and memleak-bpfcc still complained about leaking memory. I verified that the finalizer ran and also explicitly closed the session on every reload. Same result.

I will check with our production setup too and will come back then.

Edit: I monitored our service with the latest version of tfgo for a while and can see that the memory increases on each reload.

galeone commented 1 year ago

If it's not the leak related to the missing session close, I have no other clue :thinking:

For sure there was a leak caused by the missing session.close(), which now should be fixed (I forgot to add the same line in the tg.ImportModel and I fixed it yesterday).

But I have no idea of what could cause this issue on tfgo side - maybe it's a leak in the TensorFlow C library

jriegner commented 1 year ago

Alright, I will update the version on my side and give it a try. Thanks for your support, maybe the last update fixed it 👍🏻

LoveVsLike commented 9 months ago

Update: I guess I made it b536202

Give it a try and let me know if it works

I try it, but is not work!

galeone commented 9 months ago

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

LoveVsLike commented 8 months ago

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

Hi, I update TF to last version and tfgo, But is not work, Can you put a issue to tf?