ctuning / ck-openvino

Collective Knowledge workflows for OpenVINO Toolkit (Deep Learning Deployment Toolkit)
BSD 3-Clause "New" or "Revised" License
7 stars 7 forks source link

Collective Knowledge workflows for OpenVINO toolkit

All CK components can be found at cKnowledge.io and in one GitHub repository!

This project is hosted by the cTuning foundation.

MLPerf Inference v0.5 demo

Download and run a Docker image built from reusable CK components:

$ docker pull ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04
$ docker run  ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04
...
accuracy=75.600%, good=378, total=500

The source Dockerfile should be self-explanatory.

Contact info@dividiti.com for more information.

Benchmarking quantized SSD-MobileNet

We compare performance for the symmetrically quantized SSD-MobileNet-v1 model on a HP Z640 G1X62EA workstation with ten-core Intel Xeon CPU E5-2650 v3 @ 2.30GHz using:

OpenVINO

$ docker pull ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04

Accuracy on the COCO 2017 validation set

$ export NPROCS=`grep -c processor /proc/cpuinfo` && docker run --rm ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04 \
  "ck run ck-openvino:program:mlperf-inference-v0.5 --cmd_key=object-detection --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE=Accuracy --env.CK_LOADGEN_DATASET_SIZE=5000 \
  --env.CK_OPENVINO_NTHREADS=$NPROCS --env.CK_OPENVINO_NSTREAMS=$NPROCS --env.CK_OPENVINO_NIREQ=$NPROCS \
  && cat /home/dvdt/CK_REPOS/ck-openvino/program/mlperf-inference-v0.5/tmp/accuracy.txt"
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.226
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.347
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.246
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.018
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.158
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.520
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.205
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.257
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.258
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.023
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.183
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.594
mAP=22.617%

Performance (with all virtual CPU cores)

$ export NPROCS=`grep -c processor /proc/cpuinfo` && docker run --rm ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04 \
  "ck run ck-openvino:program:mlperf-inference-v0.5 --cmd_key=object-detection --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE=Performance --env.CK_LOADGEN_TARGET_QPS=7 \
  --env.CK_OPENVINO_NTHREADS=$NPROCS --env.CK_OPENVINO_NSTREAMS=$NPROCS --env.CK_OPENVINO_NIREQ=$NPROCS \
  && echo '--------------------------------------------------------------------------------' \
  && echo 'mlperf_log_summary.txt' \
  && echo '--------------------------------------------------------------------------------' \
  && cat  /home/dvdt/CK_REPOS/ck-openvino/program/mlperf-inference-v0.5/tmp/mlperf_log_summary.txt \
  && echo '--------------------------------------------------------------------------------'" 
...
================================================
MLPerf Results Summary
================================================
SUT name : SUT
Scenario : Offline
Mode     : Performance
Samples per second: 260.554
Result is : INVALID
  Min duration satisfied : NO
  Min queries satisfied : Yes
Recommendations:
 * Increase expected QPS so the loadgen pre-generates a larger (coalesced) query.

(We set CK_LOADGEN_TARGET_QPS=7 to workaround a known issue with Intel's code leading to a segfault.)

Performance (with all physical CPU cores)

$ export NPROCS=$((`grep -c processor /proc/cpuinfo` / 2)) && docker run --rm ctuning/mlperf-inference-v0.5.openvino:ubuntu-18.04 \
  "ck run ck-openvino:program:mlperf-inference-v0.5 --cmd_key=object-detection --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE=Performance --env.CK_LOADGEN_TARGET_QPS=7 \
  --env.CK_OPENVINO_NTHREADS=$NPROCS --env.CK_OPENVINO_NSTREAMS=$NPROCS --env.CK_OPENVINO_NIREQ=$NPROCS \
  && echo '--------------------------------------------------------------------------------' \
  && echo 'mlperf_log_summary.txt' \
  && echo '--------------------------------------------------------------------------------' \
  && cat  /home/dvdt/CK_REPOS/ck-openvino/program/mlperf-inference-v0.5/tmp/mlperf_log_summary.txt \
  && echo '--------------------------------------------------------------------------------'" 
...
================================================
MLPerf Results Summary
================================================
SUT name : SUT
Scenario : Offline
Mode     : Performance
Samples per second: 235.173
Result is : INVALID
  Min duration satisfied : NO
  Min queries satisfied : Yes
Recommendations:
 * Increase expected QPS so the loadgen pre-generates a larger (coalesced) query.

MKL-optimized TensorFlow 2.0

$ docker pull ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.0.1-mkl-py3

Accuracy on the COCO 2017 validation set

$ export NPROCS=$((`grep -c processor /proc/cpuinfo` / 2)) && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.0.1-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='--accuracy' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 5000 --max-batchsize 1' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized"
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
INFO:main:Namespace(accuracy=True, backend='tensorflow', backend_params='BATCH_SIZE:1,TENSORRT_PRECISION:FP32,TENSORRT_DYNAMIC:0', cache=0, config='/home/dvdt/CK_TOOLS/mlperf-inference-upstream.master/inference/v0.5/mlperf.conf', count=5000, data_format=None, dataset='coco-standard', dataset_height=300, dataset_list=None, dataset_path='/home/dvdt/CK_TOOLS/dataset-coco-2017-val', dataset_width=300, find_peak_performance=False, inputs=['image_tensor:0'], max_batchsize=1, max_latency=None, model='/home/dvdt/CK_TOOLS/model-tf-mlperf-ssd-mobilenet-quantized-finetuned/graph.pb', model_name='${CK_ENV_TENSORFLOW_MODEL_MODEL_NAME}', output=None, outputs=['num_detections:0', 'detection_boxes:0', 'detection_scores:0', 'detection_classes:0'], profile='default_tf_object_det_zoo', qps=None, samples_per_query=None, scenario='Offline', threads=20, time=None)
INFO:coco:loaded 5000 images, cache=0, took=38.3sec
2020-06-04 12:18:32.395924: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-04 12:18:32.402349: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 12:18:32.403571: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5158100 executing computations on platform Host. Devices:
2020-06-04 12:18:32.403589: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-06-04 12:18:32.404598: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.362
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.257
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.166
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.545
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.212
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.617
TestScenario.Offline qps=123.63, mean=4.3781, time=40.444, acc=94.0344%, mAP=23.538530905017076%, queries=5000, tiles=50.0:4.5576,80.0:5.4431,90.0:5.6411,95.0:5.7654,99.0:6.0230,99.9:6.2053

Performance with the default parameters

$ docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.0.1-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized"
...
2020-06-04 11:44:44.287762: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-04 11:44:44.294142: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 11:44:44.295588: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x8474510 executing computations on platform Host. Devices:
2020-06-04 11:44:44.295606: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-06-04 11:44:44.296235: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=26.31, mean=10.1521, time=17.561, queries=462, tiles=50.0:9.7158,80.0:16.2235,90.0:16.3415,95.0:17.5336,99.0:17.5336,99.9:17.5336

Performance with all virtual CPU cores

$ export NPROCS=`grep -c processor /proc/cpuinfo` && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.0.1-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized \
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
2020-06-04 11:55:31.771622: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.

User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=1
   KMP_SETTINGS=1
   OMP_NUM_THREADS=20
...
2020-06-04 11:55:31.778172: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 11:55:31.779562: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x846b2b0 executing computations on platform Host. Devices:
2020-06-04 11:55:31.779580: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-06-04 11:55:31.780533: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=34.74, mean=7.6564, time=13.300, queries=462, tiles=50.0:7.4600,80.0:11.9973,90.0:12.9363,95.0:13.2779,99.0:13.2779,99.9:13.2779

Performance with all physical CPU cores

$ export NPROCS=$((`grep -c processor /proc/cpuinfo` / 2)) && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.0.1-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized \
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
2020-06-04 11:52:06.230481: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.

User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=1
   KMP_SETTINGS=1
   OMP_NUM_THREADS=10
...
2020-06-04 11:52:06.237142: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 11:52:06.238523: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f15070 executing computations on platform Host. Devices:
2020-06-04 11:52:06.238551: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-06-04 11:52:06.239762: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=44.04, mean=6.1680, time=10.490, queries=462, tiles=50.0:6.0186,80.0:9.5817,90.0:10.2260,95.0:10.4660,99.0:10.4660,99.9:10.4660

MKL-optimized (?) TensorFlow 2.1

$ docker pull ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3

Accuracy on the COCO 2017 validation set

$ export NPROCS=$((`grep -c processor /proc/cpuinfo` / 2)) && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='--accuracy' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 5000 --max-batchsize 1' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized"
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
INFO:main:Namespace(accuracy=True, backend='tensorflow', backend_params='BATCH_SIZE:1,TENSORRT_PRECISION:FP32,TENSORRT_DYNAMIC:0', cache=0, config='/home/dvdt/CK_TOOLS/mlperf-inference-upstream.master/inference/v0.5/mlperf.conf', count=5000, data_format=None, dataset='coco-standard', dataset_height=300, dataset_list=None, dataset_path='/home/dvdt/CK_TOOLS/dataset-coco-2017-val', dataset_width=300, find_peak_performance=False, inputs=['image_tensor:0'], max_batchsize=1, max_latency=None, model='/home/dvdt/CK_TOOLS/model-tf-mlperf-ssd-mobilenet-quantized-finetuned/graph.pb', model_name='${CK_ENV_TENSORFLOW_MODEL_MODEL_NAME}', output=None, outputs=['num_detections:0', 'detection_boxes:0', 'detection_scores:0', 'detection_classes:0'], profile='default_tf_object_det_zoo', qps=None, samples_per_query=None, scenario='Offline', threads=20, time=None)
INFO:coco:loaded 5000 images, cache=0, took=93.7sec
2020-06-04 13:07:12.428599: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-04 13:07:13.269997: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 13:07:13.282636: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x8ec4570 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-04 13:07:13.282681: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 13:07:13.294754: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.236
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.362
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.258
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.166
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.544
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.212
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.618
TestScenario.Offline qps=129.41, mean=4.1776, time=38.637, acc=94.0231%, mAP=23.566332815709149%, queries=5000, tiles=50.0:4.2399,80.0:5.2161,90.0:5.4090,95.0:5.5132,99.0:5.6607,99.9:5.7809

Performance with the default parameters

$ docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized"
...
2020-06-04 13:14:03.208913: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-04 13:14:03.215571: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 13:14:03.216924: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x43f0270 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-04 13:14:03.216942: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 13:14:03.217047: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=26.05, mean=10.1722, time=17.736, queries=462, tiles=50.0:10.3504,80.0:15.4264,90.0:17.6162,95.0:17.7094,99.0:17.7094,99.9:17.7094

Performance with all virtual CPU cores

$ export NPROCS=`grep -c processor /proc/cpuinfo` && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized \
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
2020-06-04 13:15:27.034484: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=1
   KMP_SETTINGS=1
   OMP_NUM_THREADS=20
...
2020-06-04 13:15:27.041504: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 13:15:27.043613: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7a66ab0 initialized for platform Host (this does not guarantee tha
t XLA will be used). Devices:
2020-06-04 13:15:27.043649: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 13:15:27.043776: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_
op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=33.82, mean=8.0507, time=13.659, queries=462, tiles=50.0:7.9636,80.0:12.4265,90.0:13.3861,95.0:13.6354,99.0:13.6354,99.9:13.6354

Performance with all physical CPU cores

$ export NPROCS=$((`grep -c processor /proc/cpuinfo` / 2)) && \
  docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3 \
  "ck run program:mlperf-inference-vision --cmd_key=direct --skip_print_timers \
  --env.CK_LOADGEN_SCENARIO=Offline --env.CK_LOADGEN_MODE='' \
  --env.CK_LOADGEN_EXTRA_PARAMS='--count 462' \
  --env.CK_LOADGEN_REF_PROFILE=default_tf_object_det_zoo \
  --dep_add_tags.weights=ssd,mobilenet-v1,mlperf,quantized \
  --env.KMP_SETTINGS=1 --env.KMP_BLOCKTIME=1 --env.KMP_AFFINITY='granularity=fine,compact,1,0' \
  --env.OMP_NUM_THREADS=$NPROCS"
...
2020-06-04 13:16:54.080275: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=1
   KMP_SETTINGS=1
   OMP_NUM_THREADS=10
...
2020-06-04 13:16:54.087292: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2294550000 Hz
2020-06-04 13:16:54.088427: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6765560 initialized for platform Host (this does not guarantee tha
t XLA will be used). Devices:
2020-06-04 13:16:54.088444: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-04 13:16:54.088520: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_
op_parallelism_threads for best performance.
INFO:main:starting TestScenario.Offline
TestScenario.Offline qps=44.26, mean=6.1474, time=10.438, queries=462, tiles=50.0:6.0423,80.0:9.4969,90.0:10.1914,95.0:10.4173,99.0:10.4173,99.9:10.4173

NB: Note that the performance with the 2.1.0-mkl-py3-based image is the same as with the 2.0.1-mkl-py3-based image despite the scary message:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

MKL is actually enabled, as can be confirmed by running:

$ docker run ctuning/mlperf-inference-vision-with-ck.intel.ubuntu-18.04:2.1.0-mkl-py3 \
"python -c 'from tensorflow.python import _pywrap_util_port; print(_pywrap_util_port.IsMklEnabled())'"
True