deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.12k stars 654 forks source link

PredictorContex 在线预测中表现太差了 #899

Closed qshian closed 3 years ago

qshian commented 3 years ago

image

我们用DJL 0.11.0-SNAPSHOT和tensorflow-core-platform 0.3.1分别测试。 都是通过 jni的方式调用,发现 djl的性能比tensorflow-core-platform差的太多了。

image

frankfliu commented 3 years ago

@qshian We benchmarked 0.10.0 with tf-java, we didn't see any difference. We made significant changes in 0.11.0-SNAPSHOT due to tf-java memory leak issue, and we still work on it.

Would you please share you benchmark script so we can try to reproduce your test?

lanking520 commented 3 years ago

@qshian 可以加我微信帮你看看benchmark的部分:lankingsonic

lanking520 commented 3 years ago

We identified a recent change that causing the performance issue, already raised the PR to revert that change:

https://github.com/deepjavalibrary/djl/pull/909

Will keep monitoring on the newer change to avoid similar issues.

frankfliu commented 3 years ago

@qshian I did benchmark test between tf-java and DJL, and I got completely opposite result as your test.

Test configuration

Test script

DJL

git clone https://github.com/deepjavalibrary/djl.git
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'

TF-Java

git clone https://github.com/frankfliu/djl.git -b tf-benchmark
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'

Result

Throughput Latency P50 (ms) Latency P90 (ms) Latency P99 (ms)
DJL 317.65 50.182 51.172 54.370
TF-java 310.59 51.829 52.822 55.172

Analysis

DJL and TF-java has almost identical performance. DJL underneath using TF-java 0.3.1, we don't expect there is significant performance difference. Overall, DJL has slightly higher throughput and better P90 and P99 latency.

Both DJL and TF-java can fully utilize the system resource (100% of all CPUs) in multithreading inference case, the following environment must be configure to get the highest throughput, see: http://docs.djl.ai/docs/development/inference_performance_optimization.html#thread-configuration_1

export OMP_NUM_THREADS=1
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=1

See: https://github.com/deepjavalibrary/djl/blob/master/examples/build.gradle#L78-L79

TF-java rely on java GC to release native tensor memory, while DJL will release the memory immediately in the inference thread. This implementation different will have the following impact:

  1. DJL single inference time will be slightly longer since we delete the memory in the inference call.
  2. TF-java GC overhead will be much higher, and will introduce more GC pause, this impact the whole JVM, and this will also impact P90 and P99 performance. And in extreme cases can cause GC OOM.
frankfliu commented 3 years ago

Feel free to reopen this issue you have further questions.