eaplatanios / tensorflow_scala

TensorFlow API for the Scala Programming Language
http://platanios.org/tensorflow_scala/
Apache License 2.0
936 stars 96 forks source link

Memory Leak when create Tensor for a training session #87

Closed lucataglia closed 6 years ago

lucataglia commented 6 years ago

@eaplatanios Here the gist that contains the code I used to reproduce the memory leak: https://gist.github.com/lucaRadicalbit/e45d895b859797f010d856ac8aa22a13

Basically I create two tensor and inside an infinite loop and then I run the session using that tensors as feeds. Using VisualVM after 8000 iterations (about 6 minutes) the situation was:

sampler

heap

As can be notice looking at the gist I also tried to use the close() method on the tensors but it didn't help. I don't know if there is something that I misunderstood about how the heap is handled or if there is actually a memory leak like it seem to be.

eaplatanios commented 6 years ago

@lucaRadicalbit Thanks for finding out about this and profiling it. I'll look into it this weekend. It does seem like an important issue. :)

lucataglia commented 6 years ago

@eaplatanios I saw the changes inside the commit. I re-try with the example also present in the gist but I sill have this memory leak. Do I need to make some call to some particular method of the Tensor or Session class in order to manually free the memory ?

eaplatanios commented 6 years ago

@lucaRadicalbit Actually, now you shouldn't be needing to call anything. Did you try clearing your Ivy cache so SBT pulls in the updated artifacts? That should be located somewhere like ~/.iv2/cache/org.platanios.

eaplatanios commented 6 years ago

@lucaRadicalbit Could you please confirm that the memory leak was resolved? :)

lucataglia commented 6 years ago

@eaplatanios Sorry I was off for 3 days. I do immediately the test

lucataglia commented 6 years ago

@eaplatanios https://gist.github.com/lucaRadicalbit/e45d895b859797f010d856ac8aa22a13 I just run the example that I wrote in the comment of the gist page without using the close() method and after 15000 iteration of the while loop the situation was:

screen shot 2018-03-19 at 10 33 10

screen shot 2018-03-19 at 10 33 32

If you run the same code on your machine you get different data with the profiler ? The profiler I use can be run from terminal typing jvisualvm This is the sbt setting I set from terminal for my test: export SBT_OPTS="-Xmx5G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=4G -Xss2M"

eaplatanios commented 6 years ago

@lucaRadicalbit This happens are you clean the Ivy cache and reload the artifacts? I'm getting very different behavior.

lucataglia commented 6 years ago

@eaplatanios Yes, I cleaned all the .ivy/ folder. I make another try this morning and I let you know. Can you think another reason other than the cache that can be the problem ?

lucataglia commented 6 years ago

@eaplatanios I found the problem, I didn't update the version. Now I try and I let you know the results

lucataglia commented 6 years ago

Using the version 0.1.2-SNAPSHOT the memory leak on the example I wrote on the gist is gone but when I run my application I got this error.

2018-03-21 14:09:51.119080: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
dyld: lazy symbol binding faileddyld: lazy symbol binding faileddyld: lazy symbol binding failed: Symbol not found: _TF_TryEvalu: Symbol not found: _TF_TryEvalu: Symbol not found: _TF_TryEvaluateConstant
  Referenced from: /ateConstant
  Referenced from: /ateConstant
  Referenced from: /private/var/folders/60/4vxzt5fs3private/var/folders/60/4vxzt5fs3private/var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflmd_nx8pq4gw6nph0000gn/T/tensorflmd_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries1579779ow_scala_native_libraries1579779ow_scala_native_libraries1579779469814695543/libtensorflow_jni.s469814695543/libtensorflow_jni.s469814695543/libtensorflow_jni.so
  Expected in: /private/var/foo
  Expected in: /private/var/foo
  Expected in: /private/var/folders/60/4vxzt5fs3md_nx8pq4gw6nplders/60/4vxzt5fs3md_nx8pq4gw6nplders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_nativh0000gn/T/tensorflow_scala_nativh0000gn/T/tensorflow_scala_native_libraries1579779469814695543/le_libraries157977946981469e_libraries1579779469814695543/libtensorflow.so

dyld: Symbol not found: _TF_TryEvaluateConstant
  Referenced from: /private/var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries1575543/libtensorflow.so

9779469814695543/libtensorflow_jni.so
  Expected in: /private/vaibtensorflow.so

r/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_ndyld: Symbol not found: _TF_TryEdyld: Symbol not found: _TF_TryEative_libraries15797794698146955valuateConstant
  Referenced frovaluateConstant
  Referenced fro43/libtensorflow.so

m: /private/var/folders/60/4vxztm: /private/var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tens5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries157orflow_scala_native_libraries1579779469814695543/libtensorflow_j9779469814695543/libtensorflow_jni.so
  Expected in: /private/var/folders/60/4vxzt5fs3md_nx8pq4gw6nph0000gn/T/tensorflow_scala_native_libraries1579779469814695543/libtensorflow.so

ni.so
  Expected in: /private/var/folders/60/4vxzt5fs3md_nx8pq4gdyld: lazy symbol binding failed/usr/local/Cellar/sbt/1.1.0/libexec/bin/sbt-launch-lib.bash: line 58: 32039 Abort trap: 6           "$@"

The error seem to be when I call the method fromMetaGraphDef of the Saver Object