No speed increase when converting model to TVM

I am trying to convert an MXNET model to TVM in order to improve the inference speed. I am able to convert it successfully, however I do not experience the improvements in speed which are advertised on this page

I have followed the tutorial here, but I will go through the steps I took.

I first downloaded the Insightface model LResNet100E-IR,ArcFace@ms1m-refine-v2 which can be found here. Note that I am using the same model from the TVM benchmark.

Next, I use the following python script to convert the model to the TVM compatible models Note that when I run the command llc --version I get the following output (which is why I set the target to skylake)

LLVM (http://llvm.org/):
  LLVM version 6.0.0

  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: skylake

Python conversion script

from tvm.contrib import graph_runtime
import mxnet as mx
from mxnet import ndarray as nd
import nnvm.compiler
import nnvm.testing
import tvm

prefix,epoch = "/home/nchafni/Cyrus/models/faceDetection/Insightface/model-r100-ii/model",0
sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
opt_level = 3

shape_dict = {'data': (1, 3, 112, 112)}
target = tvm.target.create("llvm -mcpu=skylake")
#target = tvm.target.intel_graphics()
nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params)
with nnvm.compiler.build_config(opt_level=opt_level):
   graph, lib, params = nnvm.compiler.build(nnvm_sym, target, shape_dict, params=nnvm_params)
lib.export_library("./deploy_lib.so")
print('lib export succeefully')
with open("./deploy_graph.json", "w") as fo:
   fo.write(graph.json())
with open("./deploy_param.params", "wb") as fo:
   fo.write(nnvm.compiler.save_param_dict(params))

When I run the script, I get the following warning messages:

Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 3, 112, 112, 'float32'), (64, 3, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 1, 1, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 56, 56, 'float32'), (128, 64, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 128, 56, 56, 'float32'), (128, 128, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 128, 28, 28, 'float32'), (256, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 256, 28, 28, 'float32'), (256, 256, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 256, 14, 14, 'float32'), (512, 256, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake, workload=('dense', (1, 25088, 'float32'), (512, 25088, 'float32'), (512, 'float32'), 0). A fallback configuration is used, which may bring great performance regression.
lib export succeefully

but it ultimately exports the models successfully.

Next, I import the the converted models and deploy_lib.so into my C++ project. I am using the following code. The majority of the code is taken from the example on this page

#include <chrono>
#include <iostream>
#include <fstream>
#include "opencv2/opencv.hpp"
#include "tvm/runtime/module.h"
#include "tvm/runtime/registry.h"
#include "tvm/runtime/packed_func.h"

typedef std::chrono::high_resolution_clock Clock;

class FR_MFN_Deploy{
private:
    void * handle;

public:
    FR_MFN_Deploy(std::string modelFolder)
    {
        tvm::runtime::Module mod_syslib = tvm::runtime::Module::LoadFromFile("/home/nchafni/Cyrus/tvm_test/lib/deploy_lib.so");
        //load graph
        std::string modelPath = modelFolder + "/deploy_graph.json";
        std::ifstream json_in(modelPath);
        std::string json_data((std::istreambuf_iterator<char>(json_in)), std::istreambuf_iterator<char>());
        json_in.close();
        int device_type = kDLCPU;
        int device_id = 0;
        // get global function module for graph runtime
        tvm::runtime::Module mod = (*tvm::runtime::Registry::Get("tvm.graph_runtime.create"))(json_data, mod_syslib, device_type, device_id);
        this->handle = new tvm::runtime::Module(mod);
        //load param
        std::ifstream params_in(modelFolder + "/deploy_param.params", std::ios::binary);
        std::string params_data((std::istreambuf_iterator<char>(params_in)), std::istreambuf_iterator<char>());
        params_in.close();
        TVMByteArray params_arr;
        params_arr.data = params_data.c_str();
        params_arr.size = params_data.length();
        tvm::runtime::PackedFunc load_params = mod.GetFunction("load_params");
        load_params(params_arr);
    }

    cv::Mat forward(cv::Mat inputImageAligned)
    {
        //mobilefacnet preprocess has been written in graph.
        cv::Mat tensor = cv::dnn::blobFromImage(inputImageAligned,1.0,cv::Size(112,112),cv::Scalar(0,0,0),true);
        //convert uint8 to float32 and convert to RGB via opencv dnn function
        DLTensor* input;
        constexpr int dtype_code = kDLFloat;
        constexpr int dtype_bits = 32;
        constexpr int dtype_lanes = 1;
        constexpr int device_type = kDLCPU;
        constexpr int device_id = 0;
        constexpr int in_ndim = 4;
        const int64_t in_shape[in_ndim] = {1, 3, 112, 112};
        TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
        TVMArrayCopyFromBytes(input,tensor.data,112*3*112*4);
        tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
        tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
        set_input("data", input);
        tvm::runtime::PackedFunc run = mod->GetFunction("run");
        run();
        tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");
        tvm::runtime::NDArray res = get_output(0);
        cv::Mat vector(512,1,CV_32F);
        memcpy(vector.data,res->data,512*4);
        cv::Mat _l2;
        cv::multiply(vector,vector,_l2);
        float l2 =  cv::sqrt(cv::sum(_l2).val[0]);
        vector = vector / l2;
        TVMArrayFree(input);
        return vector;
    }

};

inline float CosineDistance(const cv::Mat &v1,const cv::Mat &v2){
    return static_cast<float>(v1.dot(v2));
}

cv::Mat getTemplate(const std::string& imagePath, FR_MFN_Deploy& deploy) {
    cv::Mat data = cv::imread(imagePath);
    auto time_1 = Clock::now();
    cv::Mat out = deploy.forward(data);
    auto time_2 = Clock::now();
    std::cout << std::to_string(std::chrono::duration_cast<std::chrono::milliseconds>(time_2 - time_1).count()) << std::endl;
    return out;
}

int main() {
    std::cout << "Loading the model" << std::endl;
    FR_MFN_Deploy deploy("../models");
    std::cout << "Loaded model" << std::endl;

    // Different People
//    std::vector<std::string> imagePaths = {
//            "../images/chip17.jpg",
//            "../images/chip18.jpg",
//            "../images/chip19.jpg",
//            "../images/chip20.jpg",
//            "../images/chip21.jpg",
//            "../images/chip22.jpg",
//            "../images/chip23.jpg",
//    };

// Same person
    std::vector<std::string> imagePaths = {
            "../images/chip1.jpg",
            "../images/chip2.jpg",
            "../images/chip3.jpg",
            "../images/chip4.jpg",
            "../images/chip5.jpg",
            "../images/chip6.jpg",
            "../images/chip7.jpg",
            "../images/chip8.jpg",
            "../images/chip9.jpg",
            "../images/chip10.jpg",
            "../images/chip11.jpg",
            "../images/chip12.jpg",
            "../images/chip13.jpg",
            "../images/chip14.jpg",
            "../images/chip15.jpg",
            "../images/chip16.jpg",
    };

    std::vector<cv::Mat> res;
    std::vector<float> scoresVec;

    for (const auto& path: imagePaths) {
        res.emplace_back(getTemplate(path, deploy));
    }

    for (size_t i = 0; i < res.size(); i++) {
        for (size_t k = i + 1; k < res.size(); k++) {
            auto score = CosineDistance(res[i],res[k]);
            if (score < 0) {
                score = 0;
            }
            scoresVec.emplace_back(score);
        }
    }

    double total = 0;
    for (int i = 0; i < scoresVec.size(); ++i) {
        total +=  scoresVec[i];
        std::cout << scoresVec[i] << std::endl;
    }

    std::cout << "Total score: " << total << "\n";

    return 0;
}

Note that the images I am provided are pre-aligned and cropped to 112x112.

On average, the inference takes 360ms, which is roughly the same time it takes to perform inference using MXNET (C++, MKLDNN). I was expecting to see a significant decrease in inference time.

I am not sure if the issue has to do with the warnings during the conversion? I followed the conversion tutorial exactly and the tutorial did not mention needing to fine tune the model or anything.

Here is the output of cat /proc/cpuinfo to understand what hardward I am running the benchmark on:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz
stepping    : 9
microcode   : 0xb4
cpu MHz     : 1407.008
cache size  : 6144 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips    : 5424.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Even after fine tuning the model using the following scrip, I still see no improvements in performance.

import os
import numpy as np

import nnvm.testing
import nnvm.compiler
import tvm
import mxnet as mx
from tvm import autotvm
import tvm.relay as relay
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
import tvm.contrib.graph_runtime as runtime

def get_network(name, batch_size):
    prefix,epoch = "/home/models/faceDetection/model",0
    sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
    opt_level = 3
    shape_dict = {'data': (1, 3, 112, 112)}
    nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params)
    input_shape = (batch_size, 3, 112, 112)
    output_shape = (batch_size, 512)
    return nnvm_sym, nnvm_params, input_shape, output_shape

target = "llvm -mcpu=skylake"

batch_size = 1
dtype = "float32"
model_name = "resnet-18"
log_file = "%s.log" % model_name

num_threads = 1
os.environ["TVM_NUM_THREADS"] = str(num_threads)

tuning_option = {
    'log_filename': log_file,
    'tuner': 'random',
    'early_stopping': None,

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.LocalRunner(number=10, repeat=1,
                                   min_repeat_ms=1000),
    ),
}

# You can skip the implementation of this function for this tutorial.
def tune_kernels(tasks,
                 measure_option,
                 tuner='gridsearch',
                 early_stopping=None,
                 log_filename='tuning.log'):

    for i, tsk in enumerate(tasks):
        prefix = "[Task %2d/%2d] " % (i+1, len(tasks))

        # converting conv2d tasks to conv2d_NCHWc tasks
        op_name = tsk.workload[0]
        if op_name == 'conv2d':
            func_create = 'topi_x86_conv2d_NCHWc'
        elif op_name == 'depthwise_conv2d_nchw':
            func_create = 'topi_x86_depthwise_conv2d_NCHWc_from_nchw'
        else:
            raise ValueError("Tuning {} is not supported on x86".format(op_name))

        task = autotvm.task.create(func_create, args=tsk.args,
                                   target=target, template_key='direct')
        task.workload = tsk.workload

        # create tuner
        if tuner == 'xgb' or tuner == 'xgb-rank':
            tuner_obj = XGBTuner(task, loss_type='rank')
        elif tuner == 'ga':
            tuner_obj = GATuner(task, pop_size=50)
        elif tuner == 'random':
            tuner_obj = RandomTuner(task)
        elif tuner == 'gridsearch':
            tuner_obj = GridSearchTuner(task)
        else:
            raise ValueError("Invalid tuner: " + tuner)

        # do tuning
        n_trial=50#len(task.config_space)

        tuner_obj.tune(n_trial=n_trial,
                       early_stopping=early_stopping,
                       measure_option=measure_option,
                       callbacks=[
                           autotvm.callback.progress_bar(n_trial, prefix=prefix),
                           autotvm.callback.log_to_file(log_filename)])

########################################################################
# Finally, we launch tuning jobs and evaluate the end-to-end performance.

def tune_and_evaluate(tuning_opt):
    # extract workloads from nnvm graph
    print("Extract tasks...")
    net, params, data_shape, out_shape = get_network(model_name, batch_size)
    tasks = autotvm.task.extract_from_graph(net, target=target,
                                            shape={'data': data_shape}, dtype=dtype,
                                            symbols=(nnvm.sym.conv2d,))

    # run tuning tasks
    print("Tuning...")
    tune_kernels(tasks, **tuning_opt)

    # compile kernels with history best records
    with autotvm.apply_history_best(log_file):
        print("Compile...")
        with nnvm.compiler.build_config(opt_level=3):
            graph, lib, params = nnvm.compiler.build(
                net, target=target, shape={'data': data_shape}, params=params, dtype=dtype)

        # upload parameters to device
        ctx = tvm.cpu()
        data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))
        module = runtime.create(graph, lib, ctx)
        module.set_input('data', data_tvm)
        module.set_input(**params)

        # evaluate
        print("Evaluate inference time cost...")
        ftimer = module.module.time_evaluator("run", ctx, number=100, repeat=3)
        prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
        print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
              (np.mean(prof_res), np.std(prof_res)))

        lib.export_library("./deploy_tuned_lib.so")
        print('lib export succeefully')
        with open("./deploy_tuned_graph.json", "w") as fo:
            fo.write(graph.json())
        with open("./deploy_tuned_param.params", "wb") as fo:
            fo.write(nnvm.compiler.save_param_dict(params))

# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.

tune_and_evaluate(tuning_option)

######################################################################
# Sample Output
# -------------
# The tuning needs to compile many programs and extract feature from them.
# So a high performance CPU is recommended.
# One sample output is listed below.
#
# .. code-block:: bash
#
#    Extract tasks...
#    Tuning...
#    [Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.
#    [Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.
#    [Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.
#    [Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.
#    [Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.
#    [Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.
#    [Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.
#    [Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.
#    [Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.
#    [Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.
#    [Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.
#    [Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.
#    Compile...
#    Evaluate inference time cost...
#    Mean inference time (std dev): 3.16 ms (0.03 ms)

deepinsight / insightface

No speed increase when converting model to TVM #897