There seems to be a regression on resnet-18 model inference time (When running on GPU) post this PR, this was caught in MMS. nightly runs, the changes in this PR seem to be causing this issue.

Setup

We use MMS docker images to run load tests, we can start a local container using the following command.

nvidia-docker run --name mms_benchmark_gpu -p 8080:8080 -p 8081:8081  -itd awsdeeplearningteam/mxnet-model-server:nightly-mxnet-gpu

for building MXNet opencv 3.2 and CUDA 9.2 were used.

Load testing was done using locust, to install locust

pip install locust

Download Test image

curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg

The locust script for load testing

# test_resnet_!8.py
from locust import HttpLocust, TaskSet, task, TaskSequence, seq_task
import urllib
import os
data = None
with open(os.path.join(os.getcwd(),'kitten.jpg'), 'rb') as data:
    data = data.read()

class PredictionTasks(TaskSet):
    @task
    def inference(self):
        self.client.post("/predictions/resnet-18", data=data,headers={'Content-Type': 'image/jpeg'})

class Prediction(HttpLocust):
    task_set = PredictionTasks
    min_wait = 100
    max_wait = 100

Running Load test

Registering and loading model

# Register and load resnet-18 model archive
 curl -X POST 127.0.0.1:8081/models?url=https://s3.amazonaws.com/model-server/model_archive_1.0/resnet-18.mar

Start a single worker and run latency test

Start worker and latency test
$ curl -X PUT 'http://127.0.0.1:8081/models/resnet-18?min_worker=1&synchronous=true'
$ locust -f test_resnet_18.py Prediction --host=http://127.0.0.1:8080 --no-web  -c 1 -r 1 -t 20s --only-summary

To change mxnet version/build in docker image,

NOTE By default recent pip version is pulled.

# Go into docker image
nvidia-docker exec -u root -it mms_benchmark_gpu bash
$  pip uninstall mxnet-cu92mkl
$ pip install <new-build>.whl
ctrl + p + q to quit docker image

# Destroy existing worker, and create new worker, this loads in newly installed mxnet
$ curl -X PUT 'http://127.0.0.1:8081/models/resnet-18?min_worker=0&synchronous=true'
$ curl -X PUT 'http://127.0.0.1:8081/models/resnet-18?min_worker=1&synchronous=true'

Results

on mxnet-cu92==1.3.0post0

# locust result
 Name                                                          # reqs      # fails     Avg     Min     Max  |  Median   req/s
--------------------------------------------------------------------------------------------------------------------------------------------
 POST /predictions/resnet-18                                      152     0(0.00%)      31      30      39  |      31    7.60
--------------------------------------------------------------------------------------------------------------------------------------------
 Total                                                            152     0(0.00%)                                       7.60

Percentage of the requests completed within given times
 Name                                                           # reqs    50%    66%    75%    80%    90%    95%    98%    99%   100%
--------------------------------------------------------------------------------------------------------------------------------------------
 POST /predictions/resnet-18                                       152     31     31     31     31     32     33     33     34     280
--------------------------------------------------------------------------------------------------------------------------------------------
 Total                                                             152     31     31     31     31     32     33     33     34     280

On mxnet-cu92 with commit https://github.com/apache/incubator-mxnet/commit/f9f74169bb05f85d85dec5991aa5fc9050dec9f6

 Name                                                          # reqs      # fails     Avg     Min     Max  |  Median   req/s
--------------------------------------------------------------------------------------------------------------------------------------------
 POST /predictions/resnet-18                                      141     0(0.00%)      41      37     337  |      38    7.20
--------------------------------------------------------------------------------------------------------------------------------------------
 Total                                                            141     0(0.00%)                                       7.20

Percentage of the requests completed within given times
 Name                                                           # reqs    50%    66%    75%    80%    90%    95%    98%    99%   100%
--------------------------------------------------------------------------------------------------------------------------------------------
 POST /predictions/resnet-18                                       141     38     39     39     40     40     42     49     49    340
--------------------------------------------------------------------------------------------------------------------------------------------
 Total                                                             141     38     39     39     40     40     42     49     49    340

This regression thus carries over to 1.3.1

There is a 30% increase in latency/inference time for resnet-18 based on the above results.

apache / mxnet

performance degradation in model inference from 1.3.1 to 1.4.0 #14569

Download Test image