Closed qshian closed 3 years ago
@qshian We benchmarked 0.10.0 with tf-java, we didn't see any difference. We made significant changes in 0.11.0-SNAPSHOT due to tf-java memory leak issue, and we still work on it.
Would you please share you benchmark script so we can try to reproduce your test?
@qshian 可以加我微信帮你看看benchmark的部分:lankingsonic
We identified a recent change that causing the performance issue, already raised the PR to revert that change:
https://github.com/deepjavalibrary/djl/pull/909
Will keep monitoring on the newer change to avoid similar issues.
@qshian I did benchmark test between tf-java and DJL, and I got completely opposite result as your test.
DJL
git clone https://github.com/deepjavalibrary/djl.git
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'
TF-Java
git clone https://github.com/frankfliu/djl.git -b tf-benchmark
cd djl
./gradlew benchmark -Dai.djl.default_engine=TensorFlow --args='-n mobilenet -c 10000 -t 16 -s 1,224,224,3'
Throughput | Latency P50 (ms) | Latency P90 (ms) | Latency P99 (ms) | |
---|---|---|---|---|
DJL | 317.65 | 50.182 | 51.172 | 54.370 |
TF-java | 310.59 | 51.829 | 52.822 | 55.172 |
DJL and TF-java has almost identical performance. DJL underneath using TF-java 0.3.1, we don't expect there is significant performance difference. Overall, DJL has slightly higher throughput and better P90 and P99 latency.
Both DJL and TF-java can fully utilize the system resource (100% of all CPUs) in multithreading inference case, the following environment must be configure to get the highest throughput, see: http://docs.djl.ai/docs/development/inference_performance_optimization.html#thread-configuration_1
export OMP_NUM_THREADS=1
export TF_NUM_INTEROP_THREADS=1
export TF_NUM_INTRAOP_THREADS=1
See: https://github.com/deepjavalibrary/djl/blob/master/examples/build.gradle#L78-L79
TF-java rely on java GC to release native tensor memory, while DJL will release the memory immediately in the inference thread. This implementation different will have the following impact:
Feel free to reopen this issue you have further questions.
我们用DJL 0.11.0-SNAPSHOT和tensorflow-core-platform 0.3.1分别测试。 都是通过 jni的方式调用,发现 djl的性能比tensorflow-core-platform差的太多了。