PredictorContex 在线预测中表现太差了

qshian commented 3 years ago

我们用DJL 0.11.0-SNAPSHOT和tensorflow-core-platform 0.3.1分别测试。都是通过 jni的方式调用，发现 djl的性能比tensorflow-core-platform差的太多了。

frankfliu commented 3 years ago

@qshian We benchmarked 0.10.0 with tf-java, we didn't see any difference. We made significant changes in 0.11.0-SNAPSHOT due to tf-java memory leak issue, and we still work on it.

Would you please share you benchmark script so we can try to reproduce your test?

lanking520 commented 3 years ago

@qshian 可以加我微信帮你看看benchmark的部分：lankingsonic

lanking520 commented 3 years ago

We identified a recent change that causing the performance issue, already raised the PR to revert that change:

https://github.com/deepjavalibrary/djl/pull/909

Will keep monitoring on the newer change to avoid similar issues.

frankfliu commented 3 years ago

@qshian I did benchmark test between tf-java and DJL, and I got completely opposite result as your test.

Test configuration

Model: mobilenet v2 from DJL model zoo, you should be able to benchmark your local mode (only support float32, need change benchmark script to support other datatype). see: http://docs.djl.ai/docs/development/benchmark_with_djl.html#local-directory
Machine: AWS EC2 c5.4xlarge (16 CPU)
threads: 16
iterations per thread: 10000
input data shape: 1,224,224,3
DJL version: 0.11.0-SNAPSHOT
TF-java version: 0.3.1
Tensorflow engine version: 2.4.1

Test script

DJL

git clone https://github.com/deepjavalibrary/djl.git
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'

TF-Java

git clone https://github.com/frankfliu/djl.git -b tf-benchmark
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'

Result

	Throughput	Latency P50 (ms)	Latency P90 (ms)	Latency P99 (ms)
DJL	317.65	50.182	51.172	54.370
TF-java	310.59	51.829	52.822	55.172

Analysis

DJL and TF-java has almost identical performance. DJL underneath using TF-java 0.3.1, we don't expect there is significant performance difference. Overall, DJL has slightly higher throughput and better P90 and P99 latency.

Both DJL and TF-java can fully utilize the system resource (100% of all CPUs) in multithreading inference case, the following environment must be configure to get the highest throughput, see: http://docs.djl.ai/docs/development/inference_performance_optimization.html#thread-configuration_1

export OMP_NUM_THREADS=1
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=1

See: https://github.com/deepjavalibrary/djl/blob/master/examples/build.gradle#L78-L79

TF-java rely on java GC to release native tensor memory, while DJL will release the memory immediately in the inference thread. This implementation different will have the following impact:

DJL single inference time will be slightly longer since we delete the memory in the inference call.
TF-java GC overhead will be much higher, and will introduce more GC pause, this impact the whole JVM, and this will also impact P90 and P99 performance. And in extreme cases can cause GC OOM.

frankfliu commented 3 years ago

Feel free to reopen this issue you have further questions.

deepjavalibrary / djl