Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true (CPU)

skirdey commented 3 years ago

Description

Possible memory-leak in multi-threading inference using Tensorflow and having org.bytedeco.javacpp.nopointergc=true

CPU inference.

Expected Behavior

Garbage collection removing objects from Old Generation

Error Message

Java OOM

How to Reproduce?

Run multi-threaded inference for 30minutes with -Dorg.bytedeco.javacpp.nopointergc=true so you don't have JavaCPP Deallocator blocking thread.

/gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'

skirdey commented 3 years ago

I have a high level overview of what happens captured via yourkit, at some point (>30 min of continous predictions), getting a spike in GC collections

and pause time:

Also, OldGen takes most of the space leaving nothing for new allocations

here is memory snapshot, after several forced GCs

skirdey commented 3 years ago

Same benchmark doing multi-threading inference using PyTorch has minimal memory footprint and no GC pauses: ./gradlew benchmark -Dai.djl.default_engine=PyTorch -Dai.djl.repository.zoo.location="https://alpha-djl-demos.s3.amazonaws.com/model/djl-blockrunner/pytorch_resnet18.zip?model_name=traced_resnet18" --args='-c 100000 -d 1440 -t 10 -s 1,3,224,224' Screen Shot 2021-02-27 at 4 21 11 PM

karllessard commented 3 years ago

Please see https://github.com/tensorflow/java/pull/229 for a potential fix for this problem in the TF Java library.

skirdey commented 3 years ago

@karllessard is it possible to release tf java 0.2.1 with the hotfix? it doesn't look like a breaking change so a patch would be nice.

DJl uses release versions of TF Java

saudet commented 3 years ago

@skirdey 0.3.0-SNAPSHOT is pretty much up now, so please give that a try, but like I tried to explain on the pull request, I don't believe this will fix any issues for DJL.

karllessard commented 3 years ago

Good news @skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed, I will provide more details a bit later but you can also follow this thread to see the progress.

skirdey commented 3 years ago

The fix could be useful when I do not set nopointergc=True, as it was also leaking memory just took 70 hours to catch it.

On Friday, 12 March 2021, Karl Lessard @.***> wrote:

Good news @skirdey https://github.com/skirdey ! Important build issues that were preventing us to release 0.3.0 are now fixed https://github.com/tensorflow/java/pull/240, I will provide more details a bit later but you can also follow this thread https://github.com/tensorflow/java/issues/230 to see the progress.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-797508048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQDEYHKNQHHHV6OPGE3TDIGBJANCNFSM4YFNJGOA .

saudet commented 3 years ago

That doesn't fix any leak. If there's a leak, it's somewhere else.

stu1130 commented 3 years ago

DJL team can publish 0.11.0-SNAPSHOT that depends on 0.3.0-SNAPSHOT. I am on it.

karllessard commented 3 years ago

We just had our community meeting and I’m targeting making a release of 0.3.0 by March 15th with the actual snapshot. Only detail is that the Windows MKL+GPU platform won’t be available for that release (at least not for this date), all other platforms will.

karllessard commented 3 years ago

Hi, just to let you know that 0.3.0 is now available on Maven Central. Please let us know how it goes with your experiments on this issue, thank you

skirdey commented 3 years ago

We tried it out and the memory still leaks, very fast when nopointergc=true and it takes 5 hours to halt when nopointergc=false @karllessard @saudet

stu1130 commented 3 years ago

Like @skirdey mentioned Here is my setting. c5.2xlarge instance with jvmArgs -Xmx256m (make it easier to be OOM) command:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true

-Dorg.bytedeco.javacpp.nopointergc=true took 2m 31s -Dorg.bytedeco.javacpp.nopointergc=false took 5h 11m 6s

We are going to integrate TfNDManager with PointerScope as @saudet suggested for the next step

saudet commented 3 years ago

@stu1130 Can you point us to the code in DJL for that benchmark?

stu1130 commented 3 years ago

@stu1130 Can you point us to the code in DJL for that benchmark?

https://github.com/awslabs/djl/blob/master/examples/src/main/java/ai/djl/examples/inference/benchmark/MultithreadedBenchmark.java

saudet commented 3 years ago

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

saudet commented 3 years ago

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

stu1130 commented 3 years ago

I mean, what does it end up calling in TF Java? For example, @roywei says that

Yes, we closed the input tensor after TfNDArray creation, only keep the constant operand for future eager operations(usually preprocess before prediction). The preprocess output will be converted back to Tensor and feed into Classic Session(loaded from SavedModel) for inference. We use try catch blocks around NDManagers similar to using eager sessions.

It sounds like there are dangling references to "constant operands" that are not closed with their sessions. Why is that? How do you make sure they get deallocated?

When TfNDArray.close(), the operand is set to null.

Another thing you need to be careful about is that, currently, all output Tensor objects are not closed when a Session is closed, but you need to make sure that all output tensors are closed before the Session is closed. If you do not close all output Tensor before closing their Session, even if they are closed later on, that may still result in memory leaks. It's a bit weird, and I hope the people from TF Java realize that it's not a great way to do things, but I haven't been able to convince them, yet. For now, we need to manage all that manually.

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

karllessard commented 3 years ago

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

stu1130 commented 3 years ago

The behavior is not the same in eager or graph mode.

In eager mode, it is true that the lifetime of a Tensor is attached to an eager session and closing this session will release all tensors it has implicitly allocated (e.g. when converting operations results to a tensor). Tensors allocated manually by a user still need to be freed manually as well since the user has the ownership.

In graph mode though, all tensors returned by a session.run() must be deallocated manually by the user and closing the graph nor session won't close them, like it is explained here. We are already working on a solution so the users won't need to worry too much about it in the near future.

When TfNDArray.close(), the operand is set to null.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

Thanks @karllessard for those pointers. We will try to close tensors for both EagerSession/Session ASAP.

saudet commented 3 years ago

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

In eager mode, this should indeed do the job as the GC will eventually free the constant. The leak is probably elsewhere then. But note that one drawback on relying on the GC to free up the resources is that it does not know what is the size of the native resources being allocated. Therefore, while the JVM might think that there is plenty of memory available and that a GC cycle can wait, you might be running low, especially if you are keeping hold on very large tensors/constants for a long time. Again, the safest route is to close the eager session as soon as you can, but I'm not sure how's that possible in your current architecture, I'd have to take a deeper look at it to understand clearly how it works.

That's not true. I've explained this many times already, so I won't try to reexplain this here, but with the way you've "fixed" this, leaks will still happen, unless of course everything is done manually.

karllessard commented 3 years ago

Hi @stu1130 ,

I was experimenting with DJL's benchmark this evening and while I cannot draw conclusions yet on the TF side, I've noticed that the benchmark is collecting multiple metrics on each inference, which are being piled up in a list for the whole duration of the test and reaching the millions in number after a few hours. While instances of Metric seem quite small in size, it could be interesting to see if you still observe OOM on a JVM of 256M when disabling the metric collection, like it seems possible to do here, to make sure this constantly-growing list does not interfere with the experiment. What do you think?

saudet commented 3 years ago

My understanding is that when we create Tensor from EagerSession or Session, the pointer to Tensor is attached to EagerSession/Session. Now with new fix, the Tensor is hold with WeakReference and therefore could be released by GC. How does it result in memory leaks? When we close the EagerSession/Session, shouldn't it free all attached Tensors?

The current implementation is incorrect, but they (the rest of the SIG JVM) are not concerned about this issue. Please make sure that SIG JVM knows that you would like this to be fixed in TF Java. I will keep trying to make this happen as part of SIG JVM, but I can't promise anything, so any help is welcome.

To clarify what I wrote above, the "issue" I'm referring to isn't the one in this thread about memory leaks in DJL, but another one concerning the incorrect use of WeakReference that @stu1130 inquired about. I think everyone is on the same page about fixing memory leaks in general, and right now we can avoid them by making sure that everything gets deallocated manually. Even if we started using WeakReference correctly somehow, it is my opinion that relying on GC is a bad idea anyway, and this is where my opinion diverges with the other members of SIG JVM. We all agree that calling things like Tensor.close() should work though, so if there's a bug there, we'll fix that.

Craigacp commented 3 years ago

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 296.217 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading:     100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2050.987 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

I don't usually get this out of TF-Java when we run it separately, and I know that DJL uses it's own library loader which partially delegates to ours.

karllessard commented 3 years ago

So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).

Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?

BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.

skirdey commented 3 years ago

hey @karllessard , thanks for sharing the findings. Did you have nopointerGC setup to true or false when running the benchmark?

stu1130 commented 3 years ago

So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).

Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?

BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.

@karllessard Thanks for the experiment. Yeah we did add lots of metrics objects so that could be one of root cause of OOM. But our goal is long-running inference with disabled GC to get best performance. I am doing a large refactoring on our TF engine and try to close the native resource ASAP.

Craigacp commented 3 years ago

I ran the benchmark a few times on an unloaded server (using 2x Broadwell Xeons, running OL 7), using 10 threads for 500 iterations (as any more made the nopointergc version go pop). I ran each one three times to see if the scores were consistent which they were (around 1 point either side for the throughput number) and I've pasted the middle run for each version below. There doesn't seem to be much difference between the two.

No pointer GC:

$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs'  -Dorg.bytedeco.javacpp.nopointergc=true

> Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 300.334 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading:     100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2375.419 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 46.58, completed 5000 iteration in 107338 ms.
[INFO ] - Model loading time: 2375.419 ms.
[INFO ] - total P50: 164.165 ms, P90: 226.445 ms, P99: 308.077 ms
[INFO ] - inference P50: 161.594 ms, P90: 222.065 ms, P99: 305.454 ms
[INFO ] - preprocess P50: 1.815 ms, P90: 4.437 ms, P99: 26.296 ms
[INFO ] - postprocess P50: 0.049 ms, P90: 0.092 ms, P99: 0.590 ms

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 52s
20 actionable tasks: 1 executed, 19 up-to-date

Pointer GC:

$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs'

> Task :examples:benchmark
Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Load library 2.4.1 in 333.022 ms.
[INFO ] - Running MultithreadedBenchmark on: cpu().
Loading:     100% |████████████████████████████████████████|
Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Model tf_resnet loaded in: 2376.436 ms.
[INFO ] - Multithreaded inference with 10 threads.
Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[INFO ] - Completed 1000 requests
[INFO ] - Completed 2000 requests
[INFO ] - Completed 3000 requests
[INFO ] - Completed 4000 requests
[INFO ] - Completed 5000 requests
[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]
[INFO ] - Throughput: 45.93, completed 5000 iteration in 108853 ms.
[INFO ] - Model loading time: 2376.436 ms.
[INFO ] - total P50: 161.868 ms, P90: 219.891 ms, P99: 295.199 ms
[INFO ] - inference P50: 159.389 ms, P90: 216.400 ms, P99: 292.355 ms
[INFO ] - preprocess P50: 1.756 ms, P90: 4.570 ms, P99: 29.588 ms
[INFO ] - postprocess P50: 0.047 ms, P90: 0.087 ms, P99: 0.545 ms

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 53s
20 actionable tasks: 1 executed, 19 up-to-date

The 1% difference doesn't seem to be significant given the amount of variance I saw (throughput for GC: 48.69, 44.48, 45.93, throughput for no GC: 47.38, 46.58, 45.30). Do you have a benchmark which does show a difference?

skirdey commented 3 years ago

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=" https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3'

OOMs for me in a few minutes on my laptop

On Thu, Mar 25, 2021 at 4:59 PM Adam Pocock @.***> wrote:

I ran the benchmark a few times on a lightly loaded server (using Broadwell Xeons), using 10 threads for 500 iterations (as any more made the nopointergc version go pop). I ran each one three times to see if the scores were consistent which they were (around 1 point either side for the throughput number) and I've pasted the middle run for each version below. There doesn't seem to be much difference between the two.

No pointer GC:

$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs' -Dorg.bytedeco.javacpp.nopointergc=true

Task :examples:benchmark

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Load library 2.4.1 in 300.334 ms.

[INFO ] - Running MultithreadedBenchmark on: cpu().

Loading: 100% |████████████████████████████████████████|

Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Model tf_resnet loaded in: 2375.419 ms.

[INFO ] - Multithreaded inference with 10 threads.

Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Completed 1000 requests

[INFO ] - Completed 2000 requests

[INFO ] - Completed 3000 requests

[INFO ] - Completed 4000 requests

[INFO ] - Completed 5000 requests

[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]

[INFO ] - Throughput: 46.58, completed 5000 iteration in 107338 ms.

[INFO ] - Model loading time: 2375.419 ms.

[INFO ] - total P50: 164.165 ms, P90: 226.445 ms, P99: 308.077 ms

[INFO ] - inference P50: 161.594 ms, P90: 222.065 ms, P99: 305.454 ms

[INFO ] - preprocess P50: 1.815 ms, P90: 4.437 ms, P99: 26.296 ms

[INFO ] - postprocess P50: 0.049 ms, P90: 0.092 ms, P99: 0.590 ms

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.

Use '--warning-mode all' to show the individual deprecation warnings.

See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 52s

20 actionable tasks: 1 executed, 19 up-to-date

Pointer GC:

$ ./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 500 -s 1,224,224,3 -o logs'

Task :examples:benchmark

Warning: Could not load Loader: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

Warning: Could not load Pointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

Warning: Could not load BytePointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Load library 2.4.1 in 333.022 ms.

[INFO ] - Running MultithreadedBenchmark on: cpu().

Loading: 100% |████████████████████████████████████████|

Warning: Could not load PointerPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Model tf_resnet loaded in: 2376.436 ms.

[INFO ] - Multithreaded inference with 10 threads.

Warning: Could not load IntPointer: java.lang.UnsatisfiedLinkError: no jnijavacpp in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

[INFO ] - Completed 1000 requests

[INFO ] - Completed 2000 requests

[INFO ] - Completed 3000 requests

[INFO ] - Completed 4000 requests

[INFO ] - Completed 5000 requests

[INFO ] - Inference result: [-5.5703076E-37, -8.1853876E14, -1.0163006E-4 ...]

[INFO ] - Throughput: 45.93, completed 5000 iteration in 108853 ms.

[INFO ] - Model loading time: 2376.436 ms.

[INFO ] - total P50: 161.868 ms, P90: 219.891 ms, P99: 295.199 ms

[INFO ] - inference P50: 159.389 ms, P90: 216.400 ms, P99: 292.355 ms

[INFO ] - preprocess P50: 1.756 ms, P90: 4.570 ms, P99: 29.588 ms

[INFO ] - postprocess P50: 0.047 ms, P90: 0.087 ms, P99: 0.545 ms

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.

Use '--warning-mode all' to show the individual deprecation warnings.

See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 53s

20 actionable tasks: 1 executed, 19 up-to-date

Do you have a benchmark which does show a difference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807767117, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQB7EJJUQO4GMCHM66LTFPE4PANCNFSM4YFNJGOA .

Craigacp commented 3 years ago

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=" https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -t 10 -c 1000000 -s 1,224,224,3' OOMs for me in a few minutes on my laptop

I think we'd recommend you leave the GC on unless it noticeably slows things down. I was wondering if there is a benchmark which shows a slowdown with the GC on.

karllessard commented 3 years ago

@skirdey, nopointergc must not be set to true. Also, you might need to comment out this line in the benchmark as I did, since metrics collected by DJL during the benchmark ends up taking a lot of space, please read my previous comment about that.

Meanwhile I'll look to run a different test, thanks

skirdey commented 3 years ago

For my use-case, I can not really have a blocking thread listening for GC - I run Scala / Akka on my end and it somehow affects entire ecosystem performance.

On Thu, Mar 25, 2021 at 5:20 PM Karl Lessard @.***> wrote:

@skirdey https://github.com/skirdey, nopointergc must not be set to true. Also, you might need to comment out this line https://github.com/awslabs/djl/blob/215226750783d15a06e492e4ea96c429f8d2f103/examples/src/main/java/ai/djl/examples/inference/benchmark/MultithreadedBenchmark.java#L136 in the benchmark as I did, since metrics collected by DJL during the benchmark ends up taking a lot of space, please read my previous comment https://github.com/awslabs/djl/issues/690#issuecomment-805471267 about that.

Meanwhile I'll look to run a different test, thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807801709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQGGLR7CBBGKI2OWAX3TFPHNXANCNFSM4YFNJGOA .

Craigacp commented 3 years ago

Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.

skirdey commented 3 years ago

Still on Java 8 :/

On Thursday, 25 March 2021, Adam Pocock @.***> wrote:

Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807846025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA .

skirdey commented 3 years ago

I'll try out the new fix with nopointergc=false and report back.

On Thu, Mar 25, 2021 at 5:53 PM Stan Kirdey @.***> wrote:

Still on Java 8 :/

On Thursday, 25 March 2021, Adam Pocock @.***> wrote:

Do you have this problem with Java 9 cleaners too? They are increasingly used inside the JVM to deal with resource cleanup and it's basically the same idiom.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/djl/issues/690#issuecomment-807846025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHIJQFA4ENOTM22D32CY2TTFPLFNANCNFSM4YFNJGOA .

karllessard commented 3 years ago

Thanks @skirdey , please don't forget to disable DJL's metrics since they can grow drastically during the benchmark and cause an OOM if you are running on a low-memory JVM.

stu1130 commented 3 years ago

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?

./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'

I found we don't release TF_Output pointer in session run. Can it be the reason?

saudet commented 3 years ago

When running the benchmark above I get a bunch of warnings out of JavaCPP. @saudet any idea what's causing that?

@Craigacp It just means that the jnijavacpp library itself is missing, but same as https://github.com/tensorflow/java/issues/189#issuecomment-760594392, we can ignore those warnings, unless we get some other loading error that is...

@saudet I am testing against mine new refactoring that purely depends on Javacpp layer. What we do now is to disable GC, track every native resource and close them as soon as we don't need them. But I found there is a small incremental leak on native memory. Is there a way to find it out? I tried the command you showed in another issue but it didn't print anything . Does it only work on Pointer attached to a PointerScope? Or it works on every pointer we created?
./gradlew benchmark -Dorg.bytedeco.javacpp.nopointergc=true -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location="https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet" --args='-n tf_resnet -c 10000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.logger.debug -DargLine=-Xmx200m 2>&1 | grep Collecting | grep -v 'ownerAddress=0x0'
I found we don't release TF_Output pointer in session run. Can it be the reason?

When "org.bytedeco.javacpp.nopointergc" is true, JavaCPP won't be trying to use GC, so we're not going to get any "Collecting ..." lines in the debug log. You'll need to enable GC temporarily to see if it can pick up something that way. It won't be able to track anything that isn't registered with it though, so if you're still not seeing anything with GC enabled, I would say it's something it doesn't know about that is leaking. However, native arrays such as the one that you're linking to for TF_Output get allocated with the new operator from C++, and JavaCPP automatically registers deallocators that call their destructors using the delete operator for those, so if they were "leaking", it would most likely show up in JavaCPP's debug log. Also, it's created inside a PointerScope, so it's almost certainly getting deallocated there. To make sure, you can check that outputs.isNull() is true after leaving the scope.

As for detecting memory leaks in native code in general, I find that Valgrind works the least badly with the JVM: https://github.com/deeplearning4j/libnd4j/wiki/Debugging-libnd4j Nevertheless, with a bit more of hacking, it's also possible to use Address Sanitizer: http://btorpey.github.io/blog/2014/03/27/using-clangs-address-sanitizer/

stu1130 commented 3 years ago

So here's some interesting results to share on my side. As I've suggested in my previous post, I've explicitly disabled the collection of DJL metrics inside the benchmark and I was able to complete successfully the task in 18 hours on my Ubuntu laptop with only 128Mb of JVM memory, without OOM (GC was enabled).

Now that does not necessarily mean DJL metrics were the only stuff leaking, maybe it just gave a bit more room for others, so it would be interesting to do the test with another scenario. @skirdey , didn't you had another application you wanted to test version 0.3.0 with?

BTW, there is a new leak that has been introduced in 0.3.0 when upgrading to TF 2.4.1, but it is only related to String tensors (which were not used in the current benchmark afaik). Please just make sure you don't use String tensors in your test applications until we push a fix in the upcoming 0.3.1 version.

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]) but unfortunate got OOM command

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3'

[ERROR] -
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at ai.djl.examples.inference.benchmark.MultithreadedBenchmark.predict(MultithreadedBenchmark.java:86) [main/:?]
        at ai.djl.examples.inference.benchmark.util.AbstractBenchmark.runBenchmark(AbstractBenchmark.java:132) [main/:?]
        at ai.djl.examples.inference.benchmark.Benchmark.main(Benchmark.java:31) [main/:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
[ERROR] - Only 0/10 threads finished.

> Task :examples:benchmark FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':examples:benchmark'.
> Process 'command '/usr/lib/jvm/java-12-openjdk-amd64/bin/java'' finished with non-zero exit value 255

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full
insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 2h 1m 59s

Maybe I am using a more powful CPU machine (c5.2xlarge) so small objects like Pointer got accumulated faster to reach OOM

karllessard commented 3 years ago

I did another same test (GC is enabled && setMetric is commented out && jvmArgs = ["-Xmx128", "-Xmx128m"]

This is strange, indeed, and not easy to debug if I cannot reproduce it on my side. @Craigacp , do you still have access to these fast machines so we can run the same test on them?

BTW, TF Java 0.3.1 has been released this morning which not only fixes a leak with the new string tensor but also potentially another leak (never observed so far) that could happened when closing multiple graph sessions. It might be better to work with this version from now on.

saudet commented 3 years ago

@stu1130 128 MB is not a lot of memory for the JVM itself + a couple of things from TF and DJL. Make sure to try with something a bit larger like at least 256 or 512 MB as well.

stu1130 commented 3 years ago

@saudet I tested mine new refactoring(1G jvm size && nopointergc=true) and it went OOM after 2 hours. I used the jconsole and found native non-heap memory is pretty stable but old gen kept growing until OOM exception with several old gen GC attempts. I will share my experiment with nopointergc=false (GC enabled) as a baseline.

stu1130 commented 3 years ago

@saudet Experiment 1:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 100000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=true

EC2: c5.2xlarge
-Xmx512m -Xms512m G1GC
-Dorg.bytedeco.javacpp.nopointergc=true
OOM in 55 mins
noheap memory usage is stable

Here is JVM dump 3,4 mins before OOM. I tried to hit the perform GC button several times. looks like DeallocatorReference and NativeDeallocator, and AtomicInteger are root cause of the leak.

Experiment 2:

./gradlew benchmark -Dai.djl.default_engine=TensorFlow -Dai.djl.repository.zoo.location=https://storage.googleapis.com/tfhub-modules/tensorflow/resnet_50/classification/1.tar.gz?artifact_id=tf_resnet --args='-t 10 -n tf_resnet -c 1000000 -s 1,224,224,3' -Dorg.bytedeco.javacpp.nopointergc=false

EC2: c5.2xlarge
-Xmx512m -Xms512m G1GC
-Dorg.bytedeco.javacpp.nopointergc=false
pass the test after 5 hours
noheap memory usage is stable

Here is the jconsole snapshot

Here is the heap dump

it turns out majority of memory usage is from the Metric. So with GC enabled, system is able to release Pointer object. @karllessard it also proves removing the metrics helps a lot

@saudet Is there a way to allow GC to release those pointer objects while -Dorg.bytedeco.javacpp.nopointergc=true?

I think I found the problem. When nopointergc is true, the deallocator keep appending like a LinkedList with pointer.deallocator(new DeleteDeallocator(o)) or pointer.deallocator(new Pointer.NativeDeallocator(this, ownerAddress, deallocatorAddress)) if the pointer is a PoiterPointer or IntPointer. But it never got the chance to call remove. Each node in the LinkedList holds the strong reference to the next node causing the memory leak.

saudet commented 3 years ago

Yes, I've noticed as well. I fixed this in commit https://github.com/bytedeco/javacpp/commit/7f27899578dfa18e22738a3dd49701e1806b464a along with proper testing for that case. Sorry about that. In practice though, there is little to be gained by disabling GC entirely. As long as we make sure to deallocate explicitly everything we're not going to lose any performance.

Incidentally, this gives us a benchmark about the kind of performance we can expect with and without relying on GC. On my machine (Fedora 32), PointerTest#testDeallocator produces these kinds of numbers:

Without explicit deallocation: Took 11330 ms
With explicit deallocation: Took 52 ms

(Windows gives me similar but slightly worse results.) The time for the latter explicit case doesn't change regardless of whether "org.bytedeco.javacpp.nopointergc" is true or not. Clearly, the problem isn't only with GC itself, but also with allocating native memory in one thread, and deallocating it in another thread...

stu1130 commented 3 years ago

@saudet awesome! I was testing my new code with explicit deallocator call when I am done with that pointer. Looks like I don't need to create that PR! So I can now totally rely on PointerScope without any explicit deallocate() call right? Or do you still recommend to add deallocate call at the end of PointerScope? The reason why we would like nopointergc=true is not only about GC but we don't want the blocking deallocator thread that calls System.gc(). But thanks for that experiment. We are more clear to what direction is right. Let me know when the 1.5.6-snapshot is out and when you are going to release 1.5.6 with that fix! Thanks!

saudet commented 3 years ago

PointerScope.close() ends up calling Pointer.deallocate() under the hood, so it's the same as calling it directly. In other words, it's explicit deallocation that doesn't rely on GC. About System.gc(), another way to prevent calls to that is by setting maxBytes to 0, but that's already being done in DJL by default: https://github.com/frankfliu/djl/blob/master/tensorflow/tensorflow-engine/src/main/java/ai/djl/tensorflow/engine/LibUtils.java#L52 So it should never get called anyway and you don't need to have a new version of JavaCPP for that. Are you concerned about anything else though?

stu1130 commented 3 years ago

That's all I need. Thanks. I checked javacpp 1.6.0-SNAPSHOT but found the fix https://github.com/bytedeco/javacpp/commit/7f27899578dfa18e22738a3dd49701e1806b464a is not included yet. Let me know when it is available. I would like to try it out!

saudet commented 3 years ago

It's in 1.5.6-SNAPSHOT: https://github.com/bytedeco/javacpp/blob/7f27899578dfa18e22738a3dd49701e1806b464a/pom.xml#L7

1.6.0-SNAPSHOT is something else that's probably not going to happen.

stu1130 commented 3 years ago

@saudet So far I didn't find accumulating DeallocatorReference & NativeReference any more with the new fix! One more question, what is the life cycle of the "Pointer" returned by TF_TensorData(TF_Tensor)? Should we call close() or deallactor()? I found it doesn't call setNull() so isNull() is always false even if I close the TF_Tensor.

skirdey commented 3 years ago

accumulating DeallocatorReference & NativeReference is what I also saw in the root of the thread, so it is awesome that it is fixed! I don't think that there were any other issues when running DJL / TF / nopointergc=true in prod, but need to try run inference again at scale.

deepjavalibrary / djl