Closed skirdey closed 2 years ago
I have a high level overview of what happens captured via yourkit, at some point (>30 min of continous predictions), getting a spike in GC collections
and pause time:
Also, OldGen takes most of the space leaving nothing for new allocations
here is memory snapshot, after several forced GCs
Same benchmark doing multi-threading inference using PyTorch has minimal memory footprint and no GC pauses:
./gradlew benchmark -Dai.djl.default_engine=PyTorch -Dai.djl.repository.zoo.location="https://alpha-djl-demos.s3.amazonaws.com/model/djl-blockrunner/pytorch_resnet18.zip?model_name=traced_resnet18" --args='-c 100000 -d 1440 -t 10 -s 1,3,224,224'
Please see https://github.com/tensorflow/java/pull/229 for a potential fix for this problem in the TF Java library.
@karllessard is it possible to release tf java 0.2.1 with the hotfix? it doesn't look like a breaking change so a patch would be nice.
DJl uses release versions of TF Java
@skirdey 0.3.0-SNAPSHOT is pretty much up now, so please give that a try, but like I tried to explain on the pull request, I don't believe this will fix any issues for DJL.
Good news @skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed, I will provide more details a bit later but you can also follow this thread to see the progress.
The fix could be useful when I do not set nopointergc=True, as it was also leaking memory just took 70 hours to catch it.
On Friday, 12 March 2021, Karl Lessard @.***> wrote:
Good news @skirdey https://github.com/skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed https://github.com/tensorflow/java/pull/240, I will provide more details a bit later but you can also follow this thread https://github.com/tensorflow/java/issues/230 to see the progress.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-797508048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQDEYHKNQHHHV6OPGE3TDIGBJANCNFSM4YFNJGOA .
That doesn't fix any leak. If there's a leak, it's somewhere else.
DJL team can publish 0.11.0-SNAPSHOT that depends on 0.3.0-SNAPSHOT. I am on it.
We just had our community meeting and I’m targeting making a release of 0.3.0 by March 15th with the actual snapshot. Only detail is that the Windows MKL+GPU platform won’t be available for that release (at least not for this date), all other platforms will.
Hi, just to let you know that 0.3.0 is now available on Maven Central. Please let us know how it goes with your experiments on this issue, thank you
We tried it out and the memory still leaks, very fast when nopointergc=true and it takes 5 hours to halt when nopointergc=false @karllessard @saudet
Like @skirdey mentioned Here is my setting. c5.2xlarge instance with jvmArgs -Xmx256m (make it easier to be OOM) command:
./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true
-Dorg.bytedeco.javacpp.nopointergc=true
took 2m 31s
-Dorg.bytedeco.javacpp.nopointergc=false
took 5h 11m 6s
We are going to integrate TfNDManager with PointerScope as @saudet suggested for the next step
@stu1130 Can you point us to the code in DJL for that benchmark?
@stu1130 Can you point us to the code in DJL for that benchmark?
I mean, what does it end up calling in TF Java? For example, @roywei says that
Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.
It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?
Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.
I mean, what does it end up calling in TF Java? For example, @roywei says that
Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.
It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?
When TfNDArray.close(), the operand is set to null.
Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.
My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?
The behavior is not the same in eager or graph mode.
In eager mode, it is true that the lifetime of a Tensor
is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.
In graph mode though, all tensors returned by a session.run()
must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.
When TfNDArray.close(), the operand is set to null.
In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.
The behavior is not the same in eager or graph mode.
In eager mode, it is true that the lifetime of a
Tensor
is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.In graph mode though, all tensors returned by a
session.run()
must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.When TfNDArray.close(), the operand is set to null.
In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.
Thanks @karllessard for those pointers. We will try to close tensors for both EagerSession/Session ASAP.
My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?
The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.
In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.
That's not true. I've explained this many times already, so I won't try to reexplain this here, but with the way you've "fixed" this, leaks will still happen, unless of course everything is done manually.
Hi @stu1130 ,
I was experimenting with DJL's benchmark this evening and while I cannot draw conclusions yet on the TF side, I've noticed that the benchmark is collecting multiple metrics on each inference, which are being piled up in a list for the whole duration of the test and reaching the millions in number after a few hours. While instances of Metric
seem quite small in size, it could be interesting to see if you still observe OOM on a JVM of 256M when disabling the metric collection, like it seems possible to do here, to make sure this constantly-growing list does not interfere with the experiment. What do you think?
My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?
The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.
To clarify what I wrote above, the "issue" I'm referring to isn't the one in this thread about memory leaks in DJL, but another one concerning the incorrect use of WeakReference that @stu1130 inquired about. I think everyone is on the same page about fixing memory leaks in general, and right now we can avoid them by making sure that everything gets deallocated manually. Even if we started using WeakReference correctly somehow, it is my opinion that relying on GC is a bad idea anyway, and this is where my opinion diverges with the other members of SIG JVM. We all agree that calling things like Tensor.close() should work though, so if there's a bug there, we'll fix that.
When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 296.217 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading: 100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2050.987 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
I don't usually get this out of TF-Java when we run it separately, and I know that DJL uses it's own library loader which partially delegates to ours.
So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).
Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?
BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.
hey @karllessard , thanks for sharing the findings. Did you have nopointerGC setup to true or false when running the benchmark?
So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).
Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?
BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.
@karllessard Thanks for the experiment. Yeah we did add lots of metrics objects so that could be one of root cause of OOM. But our goal is long-running inference with disabled GC to get best performance. I am doing a large refactoring on our TF engine and try to close the native resource ASAP.
I ran the benchmark a few times on an unloaded server (using 2x Broadwell Xeons, running OL 7), using 10 threads for 500 iterations (as any more made the nopointergc version go pop). I ran each one three times to see if the scores were consistent which they were (around 1 point either side for the throughput number) and I've pasted the middle run for each version below. There doesn't seem to be much difference between the two.
No pointer GC:
$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs' -Dorg.bytedeco.javacpp.nopointergc=true
> Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 300.334 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading: 100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2375.419 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 46.58, completed 5000 iteration in 107338 ms.
[INFO ] - Model loading time: 2375.419 ms.
[INFO ] - total P50: 164.165 ms, P90: 226.445 ms, P99: 308.077 ms
[INFO ] - inference P50: 161.594 ms, P90: 222.065 ms, P99: 305.454 ms
[INFO ] - preprocess P50: 1.815 ms, P90: 4.437 ms, P99: 26.296 ms
[INFO ] - postprocess P50: 0.049 ms, P90: 0.092 ms, P99: 0.590 ms
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 1m 52s
20 actionable tasks: 1 executed, 19 up-to-date
Pointer GC:
$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs'
> Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 333.022 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading: 100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2376.436 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 45.93, completed 5000 iteration in 108853 ms.
[INFO ] - Model loading time: 2376.436 ms.
[INFO ] - total P50: 161.868 ms, P90: 219.891 ms, P99: 295.199 ms
[INFO ] - inference P50: 159.389 ms, P90: 216.400 ms, P99: 292.355 ms
[INFO ] - preprocess P50: 1.756 ms, P90: 4.570 ms, P99: 29.588 ms
[INFO ] - postprocess P50: 0.047 ms, P90: 0.087 ms, P99: 0.545 ms
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 1m 53s
20 actionable tasks: 1 executed, 19 up-to-date
The 1% difference doesn't seem to be significant given the amount of variance I saw (throughput for GC: 48.69, 44.48, 45.93, throughput for no GC: 47.38, 46.58, 45.30). Do you have a benchmark which does show a difference?
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=" https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'
OOMs for me in a few minutes on my laptop
On Thu, Mar 25, 2021 at 4:59 PM Adam Pocock @.***> wrote:
I ran the benchmark a few times on a lightly loaded server (using Broadwell Xeons), using 10 threads for 500 iterations (as any more made the nopointergc version go pop). I ran each one three times to see if the scores were consistent which they were (around 1 point either side for the throughput number) and I've pasted the middle run for each version below. There doesn't seem to be much difference between the two.
No pointer GC:
$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs' -Dorg.bytedeco.javacpp.nopointergc=true
Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 300.334 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading: 100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2375.419 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 46.58, completed 5000 iteration in 107338 ms.
[INFO ] - Model loading time: 2375.419 ms.
[INFO ] - total P50: 164.165 ms, P90: 226.445 ms, P99: 308.077 ms
[INFO ] - inference P50: 161.594 ms, P90: 222.065 ms, P99: 305.454 ms
[INFO ] - preprocess P50: 1.815 ms, P90: 4.437 ms, P99: 26.296 ms
[INFO ] - postprocess P50: 0.049 ms, P90: 0.092 ms, P99: 0.590 ms
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 1m 52s
20 actionable tasks: 1 executed, 19 up-to-date
Pointer GC:
$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs'
Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 333.022 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading: 100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2376.436 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 45.93, completed 5000 iteration in 108853 ms.
[INFO ] - Model loading time: 2376.436 ms.
[INFO ] - total P50: 161.868 ms, P90: 219.891 ms, P99: 295.199 ms
[INFO ] - inference P50: 159.389 ms, P90: 216.400 ms, P99: 292.355 ms
[INFO ] - preprocess P50: 1.756 ms, P90: 4.570 ms, P99: 29.588 ms
[INFO ] - postprocess P50: 0.047 ms, P90: 0.087 ms, P99: 0.545 ms
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 1m 53s
20 actionable tasks: 1 executed, 19 up-to-date
Do you have a benchmark which does show a difference?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807767117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQB7EJJUQO4GMCHM66LTFPE4PANCNFSM4YFNJGOA .
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=" https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3' OOMs for me in a few minutes on my laptop
I think we'd recommend you leave the GC on unless it noticeably slows things down. I was wondering if there is a benchmark which shows a slowdown with the GC on.
@skirdey, nopointergc
must not be set to true. Also, you might need to comment out this line in the benchmark as I did, since metrics collected by DJL during the benchmark ends up taking a lot of space, please read my previous comment about that.
Meanwhile I'll look to run a different test, thanks
For my use-case, I can not really have a blocking thread listening for GC - I run Scala / Akka on my end and it somehow affects entire ecosystem performance.
On Thu, Mar 25, 2021 at 5:20 PM Karl Lessard @.***> wrote:
@skirdey https://github.com/skirdey, nopointergc must not be set to true. Also, you might need to comment out this line https://github.com/awslabs/djl/blob/215226750783d15a06e492e4ea96c429f8d2f103/examples/src/main/java/ai/djl/examples/inference/benchmark/MultithreadedBenchmark.java#L136 in the benchmark as I did, since metrics collected by DJL during the benchmark ends up taking a lot of space, please read my previous comment https://github.com/awslabs/djl/issues/690#issuecomment-805471267 about that.
Meanwhile I'll look to run a different test, thanks
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807801709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQGGLR7CBBGKI2OWAX3TFPHNXANCNFSM4YFNJGOA .
Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.
Still on Java 8 :/
On Thursday, 25 March 2021, Adam Pocock @.***> wrote:
Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807846025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA .
I'll try out the new fix with nopointergc=false and report back.
On Thu, Mar 25, 2021 at 5:53 PM Stan Kirdey @.***> wrote:
Still on Java 8 :/
On Thursday, 25 March 2021, Adam Pocock @.***> wrote:
Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807846025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA .
Thanks @skirdey , please don't forget to disable DJL's metrics since they can grow drastically during the benchmark and cause an OOM if you are running on a low-memory JVM.
@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'
I found we don't release TF_Output pointer in session run. Can it be the reason?
When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?
@Craigacp It just means that the jnijavacpp
library itself is missing, but same as https://github.com/tensorflow/java/issues/189#issuecomment-760594392, we can ignore those warnings, unless we get some other loading error that is...
@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'
I found we don't release TF_Output pointer in session run. Can it be the reason?
When "org.bytedeco.javacpp.nopointergc" is true, JavaCPP won't be trying to use GC, so we're not going to get any "Collecting ..." lines in the debug log. You'll need to enable GC temporarily to see if it can pick up something that way. It won't be able to track anything that isn't registered with it though, so if you're still not seeing anything with GC enabled, I would say it's something it doesn't know about that is leaking. However, native arrays such as the one that you're linking to for TF_Output
get allocated with the new
operator from C++, and JavaCPP automatically registers deallocators that call their destructors using the delete
operator for those, so if they were "leaking", it would most likely show up in JavaCPP's debug log. Also, it's created inside a PointerScope
, so it's almost certainly getting deallocated there. To make sure, you can check that outputs.isNull()
is true after leaving the scope.
As for detecting memory leaks in native code in general, I find that Valgrind works the least badly with the JVM: https://github.com/deeplearning4j/libnd4j/wiki/Debugging-libnd4j Nevertheless, with a bit more of hacking, it's also possible to use Address Sanitizer: http://btorpey.github.io/blog/2014/03/27/using-clangs-address-sanitizer/
So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).
Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?
BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.
I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]) but unfortunate got OOM command
./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3'
[ERROR] -
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
at ai.djl.examples.inference.benchmark.MultithreadedBenchmark.predict(MultithreadedBenchmark.java:86) [main/:?]
at ai.djl.examples.inference.benchmark.util.AbstractBenchmark.runBenchmark(AbstractBenchmark.java:132) [main/:?]
at ai.djl.examples.inference.benchmark.Benchmark.main(Benchmark.java:31) [main/:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
[ERROR] - Only 0/10 threads finished.
> Task :examples:benchmark FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':examples:benchmark'.
> Process 'command '/usr/lib/jvm/java-12-openjdk-amd64/bin/java'' finished with non-zero exit value 255
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full
insights.
* Get more help at https://help.gradle.org
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD FAILED in 2h 1m 59s
Maybe I am using a more powful CPU machine (c5.2xlarge) so small objects like Pointer got accumulated faster to reach OOM
I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]
This is strange, indeed, and not easy to debug if I cannot reproduce it on my side. @Craigacp , do you still have access to these fast machines so we can run the same test on them?
BTW, TF Java 0.3.1 has been released this morning which not only fixes a leak with the new string tensor but also potentially another leak (never observed so far) that could happened when closing multiple graph sessions. It might be better to work with this version from now on.
@stu1130 128 MB is not a lot of memory for the JVM itself + a couple of things from TF and DJL. Make sure to try with something a bit larger like at least 256 or 512 MB as well.
@saudet I tested mine new refactoring(1G jvm size && nopointergc=true) and it went OOM after 2 hours. I used the jconsole and found native non-heap memory is pretty stable but old gen kept growing until OOM exception with several old gen GC attempts. I will share my experiment with nopointergc=false (GC enabled) as a baseline.
@saudet Experiment 1:
./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true
Here is JVM dump 3,4 mins before OOM. I tried to hit the perform GC button several times.
looks like DeallocatorReference
and NativeDeallocator
, and AtomicInteger
are root cause of the leak.
Experiment 2:
./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 1000000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=false
Here is the jconsole snapshot
Here is the heap dump
it turns out majority of memory usage is from the Metric. So with GC enabled, system is able to release Pointer object. @karllessard it also proves removing the metrics helps a lot
@saudet Is there a way to allow GC to release those pointer objects while -Dorg.bytedeco.javacpp.nopointergc=true?
I think I found the problem. When nopointergc is true, the deallocator keep appending like a LinkedList with pointer.deallocator(new DeleteDeallocator(o))
or pointer.deallocator(new Pointer.NativeDeallocator(this, ownerAddress, deallocatorAddress))
if the pointer is a PoiterPointer or IntPointer. But it never got the chance to call remove. Each node in the LinkedList holds the strong reference to the next node causing the memory leak.
Yes, I've noticed as well. I fixed this in commit https://github.com/bytedeco/javacpp/commit/7f27899578dfa18e22738a3dd49701e1806b464a along with proper testing for that case. Sorry about that. In practice though, there is little to be gained by disabling GC entirely. As long as we make sure to deallocate explicitly everything we're not going to lose any performance.
Incidentally, this gives us a benchmark about the kind of performance we can expect with and without relying on GC. On my machine (Fedora 32), PointerTest#testDeallocator produces these kinds of numbers:
(Windows gives me similar but slightly worse results.) The time for the latter explicit case doesn't change regardless of whether "org.bytedeco.javacpp.nopointergc" is true or not. Clearly, the problem isn't only with GC itself, but also with allocating native memory in one thread, and deallocating it in another thread...
@saudet awesome! I was testing my new code with explicit deallocator call when I am done with that pointer. Looks like I don't need to create that PR! So I can now totally rely on PointerScope without any explicit deallocate() call right? Or do you still recommend to add deallocate call at the end of PointerScope? The reason why we would like nopointergc=true is not only about GC but we don't want the blocking deallocator thread that calls System.gc(). But thanks for that experiment. We are more clear to what direction is right. Let me know when the 1.5.6-snapshot is out and when you are going to release 1.5.6 with that fix! Thanks!
PointerScope.close() ends up calling Pointer.deallocate() under the hood, so it's the same as calling it directly. In other words, it's explicit deallocation that doesn't rely on GC. About System.gc(), another way to prevent calls to that is by setting maxBytes to 0, but that's already being done in DJL by default: https://github.com/frankfliu/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/LibUtils.java#L52 So it should never get called anyway and you don't need to have a new version of JavaCPP for that. Are you concerned about anything else though?
That's all I need. Thanks. I checked javacpp 1.6.0-SNAPSHOT but found the fix https://github.com/bytedeco/javacpp/commit/7f27899578dfa18e22738a3dd49701e1806b464a is not included yet. Let me know when it is available. I would like to try it out!
It's in 1.5.6-SNAPSHOT: https://github.com/bytedeco/javacpp/blob/7f27899578dfa18e22738a3dd49701e1806b464a/pom.xml#L7
1.6.0-SNAPSHOT is something else that's probably not going to happen.
@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with the new fix! One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.
accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.
Description
Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true
CPU inference.
Expected Behavior
Garbage collection removing objects from Old Generation
Error Message
Java OOM
How to Reproduce?
Run multi-threaded inference for 30minutes with -Dorg.bytedeco.javacpp.nopointergc=true so you don't have JavaCPP Deallocator blocking thread.
/gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'