deeplearning4j / deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
http://deeplearning4j.konduit.ai
Apache License 2.0
13.62k stars 3.83k forks source link

DL4J: Possible ResNet50 Inference Regression - beta3 to snapshots? (CPU) #7271

Open AlexDBlack opened 5 years ago

AlexDBlack commented 5 years ago

Unconfirmed/not yet reproduced, as reported in gitter:

dollarHome @dollarHome 11:07 its a 24c Xeon, with the same config earlier, i got 38ms/image with beta3, the very same code/config is resulting in 62ms/image on the snapshot.

Earlier benchmarks we ran on ResNet50 CPU suggested performance is better for training on snapshots than it was for beta3. Some possibilities (a) Performance has regressed again (b) Training is faster, inference is slower (c) Relative performance (beta3/snapshots) is hardware dependent

Aha! Link: https://skymindai.aha.io/features/DL4J-6

dollarHome commented 5 years ago

Hi Alex @AlexDBlack

Attached is the test code and pom files I am using to test the performance of ResNet50. Test code is in PretrainedClassification.java. There are 2 things I did in this code as you will see in the comments.

Approach 1: Verified I was getting same results with SqueezeNet and ResNet50 with pretrained SqueezeNet, ResNet50 zoo models with only 1 static image (you can ignore this part of the code).

Approach 2: Get the performance results of the actual target under consideration ResNet50. I used warm-up cycles too and ignored that data. After that, I took the actual measurements per batch and I reported the average time per image from the batches (on gitter). I used batch size of 512, and used about 2500 images from imagenet data (link in the code itself).

System Details: 1) Ubuntu 18.04 2) 24c Xeon system (2 sockets system with 24c each, but pinned the JVM to 1 socket(24c) using numactl i.e. numactl -m 0 -N 0, verified it is actually using 1 socket using htop) 3) Used Out of Box frequency 4) JVM options:
-Xms29g -Xmx29g -XX:+UseG1GC -XX:ParallelGCThreads=1 5) Other Environment options and perf top info: as in https://gist.github.com/dollarHome/b66dd82f5443ad9205abf901c6670dd2 KMP_BLOCK_TIME=0 OMP_WAIT_POLICY=PASSIVE MKL_THREADING_LAYER=GNU MKL_NUM_THREADS=24 OMP_DISPLAY_ENV=VERBOSE OMP_NUM_THREADS=24

PretrianedClassificationCode.zip