Open jriegner opened 1 year ago
Hi @jriegner , thank you for this report!
You're right, there's a leak. And I guess I found where the leak is :smile:
LoadModel
returns a *tfgo.Model
that contains a tensorflow a *tf.SavedModel
. This tf.SavedModel
has a *tf.Session
field.
When the model is deleted, the session is not automatically closed (it's the old tf.Session
of Python - if you are familiar with TensorFlow < 2), and thus I guess the leak is in the missing invocation of session.Close().
I guess we have 2 options:
Close()
mehtod (I don't like it)tfgo.Model
is garbage collected.I kinda like more the second approach.
Right now I'm travelling so I don't know when I'll be able to work on it (I'm setting up the development environment right now while I'm on a train lol) - if you want to implement the second option, I'll be more then happy to review and merge it
Update: I guess I made it https://github.com/galeone/tfgo/commit/b53620202dc55ebc2f12a299e65837f51737f83b
Give it a try and let me know if it works
Hey, thanks for your fast reply!
I ran my local test app with the updated code and memleak-bpfcc
still complained about leaking memory. I verified that the finalizer ran and also explicitly closed the session on every reload. Same result.
I will check with our production setup too and will come back then.
Edit:
I monitored our service with the latest version of tfgo
for a while and can see that the memory increases on each reload.
If it's not the leak related to the missing session close, I have no other clue :thinking:
For sure there was a leak caused by the missing session.close(), which now should be fixed (I forgot to add the same line in the tg.ImportModel
and I fixed it yesterday).
But I have no idea of what could cause this issue on tfgo side - maybe it's a leak in the TensorFlow C library
Alright, I will update the version on my side and give it a try. Thanks for your support, maybe the last update fixed it 👍🏻
Update: I guess I made it b536202
Give it a try and let me know if it works
I try it, but is not work!
Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.
Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.
Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).
I'll update the fork and I'll let you know.
Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.
Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.
Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).
I'll update the fork and I'll let you know.
Hi, I update TF to last version and tfgo, But is not work, Can you put a issue to tf?
Hey guys,
We use
tfgo
and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.The code to load the model is
But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with
pprof
and could not see that any of the internal components in our code has a significantly growing memory usage.Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with
memleak-bpfcc
from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leakedAs you can see this stacktrace shows calls to
tfgo
and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak intfgo
or tensorflow itself.Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in
tfgo
? If you need more information on this, please tell me.Thanks in advance :)