Slow CNN on CPU - Githubissues

gsutra commented 3 years ago

Hi,

I am writing code for image segmentation, based on sample code from https://github.com/deephealthproject/use_case_pipeline/blob/master/cpp/skin_lesion_segmentation_inference.cpp

I am using the trained model provided on https://github.com/deephealthproject/use_case_pipeline : isic_segm_segnet_adam_lr_0.0001_loss_ce_size_192_epoch_24.bin This model is composed of 5 downsampling blocks (convs+relu+maxpool) and 5 upsampling blocks (upsamp+convs+relu), leading to 29,427,713 parameters.

The segmentation results are good 👍 , but inference is slow. Calling the method forward on CPU is taking approximately 30 seconds on my laptop. I find it slow, given that input images are resized as 192x192, but maybe I am missing some important point to speedup the inference ?

Here are the elapsed times for my code. Note the time for forward : 31490 milliseconds.

ImRead = 15 msec
Resize = 53 msec
Generating Random Table
Building model
Load model = 887 msec
Preproc = 1 msec
forward = 31490 msec
postproc = 13 msec
imwrite = 31 msec
total = 32492 msec
done

Below is my cpp code :

bool segementation_skin_lesion(filesystem::path filename, filesystem::path filename_output, filesystem::path filename_model)
{
    auto t0 = steady_clock::now();

    Image image;
    if (!ImRead(filename, image))
    {
        cout << "Cannot read " << filename << endl;
        return false;
    }

    auto t1 = steady_clock::now();
    cout << "ImRead = " << duration_cast<milliseconds>(t1 - t0).count() << " msec" << endl;

    // resize image
    vector<int> original_size{image.Width(), image.Height()};
    vector<int> net_size{192, 192};
    AugResizeDim(net_size).Apply(image);

    auto t2 = steady_clock::now();
    cout << "Resize = " << duration_cast<milliseconds>(t2 - t1).count() << " msec" << endl;

    // build model
    int num_classes = 1;
    layer in = Input({3, net_size[0], net_size[1]});
    layer out = SegNet(in, num_classes);
    layer out_sigm = Sigmoid(out);
    model net = Model({in}, {out_sigm});
    net->verbosity_level = 0;
    build(net, adam(0.0001f), {"cross_entropy"}, {"mean_squared_error"});
    toCPU(net);

    // load trained model weights
    try
    {
        load(net, filename_model.string());
    }
    catch (...)
    {
        cout << "Cannot load " << filename_model << endl;
        return false;
    }

    auto t3 = steady_clock::now();
    cout << "Load model = " << duration_cast<milliseconds>(t3 - t2).count() << " msec" << endl;

    // prepare input tensor
    Tensor *tensor_input;
    ImageToTensor(image, tensor_input);
    tensor_input->unsqueeze_();
    tensor_input->div_(255.);

    auto t4 = steady_clock::now();
    cout << "Preproc = " << duration_cast<milliseconds>(t4 - t3).count() << " msec" << endl;

    // run network
    forward(net, {tensor_input});

    auto t5 = steady_clock::now();
    cout << "forward = " << duration_cast<milliseconds>(t5 - t4).count() << " msec" << endl;

    // get output
    Tensor *tensor_output = getOutput(out_sigm);
    Tensor *tensor_result = tensor_output->select({"0"});
    tensor_result->mult_(255.);

    Image image_result;
    TensorToImage(tensor_result, image_result);

    delete tensor_input;
    delete tensor_output;
    delete tensor_result;

    // threshold and extract biggest component
    Threshold(image_result, image_result, 127, 255);

    image_result.colortype_ = ColorType::GRAY;
    image_result.channels_ = "xyc";
    ResizeDim(image_result, image_result, original_size, InterpolationType::nearest);

    auto t6 = steady_clock::now();
    cout << "postproc = " << duration_cast<milliseconds>(t6 - t5).count() << " msec" << endl;

    ImWrite(filename_output, image_result);

    auto t7 = steady_clock::now();
    cout << "imwrite = " << duration_cast<milliseconds>(t7 - t6).count() << " msec" << endl;

    cout << "total = " << duration_cast<milliseconds>(t7 - t0).count() << " msec" << endl;

    return true;
}

RParedesPalacios commented 3 years ago

Hi, i will check that with CPU as well and I will report asap

RParedesPalacios commented 3 years ago

Hi, i have checked SegNet over CPU and I got very similar results, 43 secs for the forward. I speedup it 4x using the -march=native flag in compilation:

cmake .. -D CMAKE_CXX_FLAGS="-march=native"

Unfortunately in some cases this compilation flag lead to segfaults on some architectures. For this reason we have several issues to consider for the CPU implementation, reported by BSC group, that still are not applied but we are on the way.

Regards

RParedesPalacios commented 3 years ago

Linux, Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz. (12 threads)

I have run a SegNet with batch_size=1 to simulate the forward of a unique image that I think is exactly what you are doing in your example. I obtained 2.9 seconds.

CMake: cmake .. -D CMAKE_CXX_FLAGS="-march=native"

In any case this is my code:

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <chrono>

#include "eddl/apis/eddl.h"

using namespace eddl;
using namespace std::chrono;

layer SegNet(layer x, const int& num_classes)
{
    x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
    x = MaxPool(x, { 2,2 }, { 2,2 });
    x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
    x = MaxPool(x, { 2,2 }, { 2,2 });
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = MaxPool(x, { 2,2 }, { 2,2 });
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = MaxPool(x, { 2,2 }, { 2,2 });
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = MaxPool(x, { 2,2 }, { 2,2 });

    x = UpSampling(x, { 2,2 });
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = UpSampling(x, { 2,2 });
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = UpSampling(x, { 2,2 });
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
    x = UpSampling(x, { 2,2 });
    x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
    x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
    x = UpSampling(x, { 2,2 });
    x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
    x = Conv(x, num_classes, { 3,3 }, { 1,1 }, "same");

    return x;
}

int main()
{
    // Settings
    int batch_size = 1;
    int num_classes = 1;
    std::vector<int> size{ 192, 192 }; // Size of images

    // Define network
    layer in = Input({ 3, size[0], size[1] });
    layer out = SegNet(in, num_classes);
    layer out_sigm = Sigmoid(out);
    model net = Model({ in }, { out_sigm });

    // Build model
    build(net,
        adam(0.0001f), //Optimizer
        { "cross_entropy" }, // Losses
        { "mean_squared_error" } // Metrics
    );

    //toGPU(net);

    // View model
    summary(net);
    plot(net, "model.pdf");

    // Prepare tensors which store batch
    Tensor* x = new Tensor({ batch_size, 3 , size[0], size[1] });
    Tensor* y = new Tensor({ batch_size, 1 , size[0], size[1] });

    cout<<"Forward start"<<endl;

    auto t1 = steady_clock::now();

    forward(net, { x });

    auto t2 = steady_clock::now();

    cout << "forward = " << duration_cast<milliseconds>(t2 - t1).count() << " msec" << endl;
}

OUTPUT:

Generating Random Table CS with full memory setup Building model

model

input1 | (3, 192, 192)=> (3, 192, 192) conv1 | (3, 192, 192)=> (64, 192, 192) relu1 | (64, 192, 192)=> (64, 192, 192) conv2 | (64, 192, 192)=> (64, 192, 192) relu2 | (64, 192, 192)=> (64, 192, 192) maxpool2 | (64, 192, 192)=> (64, 96, 96) conv3 | (64, 96, 96)=> (128, 96, 96) relu3 | (128, 96, 96)=> (128, 96, 96) conv4 | (128, 96, 96)=> (128, 96, 96) relu4 | (128, 96, 96)=> (128, 96, 96) maxpool4 | (128, 96, 96)=> (128, 48, 48) conv5 | (128, 48, 48)=> (256, 48, 48) relu5 | (256, 48, 48)=> (256, 48, 48) conv6 | (256, 48, 48)=> (256, 48, 48) relu6 | (256, 48, 48)=> (256, 48, 48) conv7 | (256, 48, 48)=> (256, 48, 48) relu7 | (256, 48, 48)=> (256, 48, 48) maxpool6 | (256, 48, 48)=> (256, 24, 24) conv8 | (256, 24, 24)=> (512, 24, 24) relu8 | (512, 24, 24)=> (512, 24, 24) conv9 | (512, 24, 24)=> (512, 24, 24) relu9 | (512, 24, 24)=> (512, 24, 24) conv10 | (512, 24, 24)=> (512, 24, 24) relu10 | (512, 24, 24)=> (512, 24, 24) maxpool8 | (512, 24, 24)=> (512, 12, 12) conv11 | (512, 12, 12)=> (512, 12, 12) relu11 | (512, 12, 12)=> (512, 12, 12) conv12 | (512, 12, 12)=> (512, 12, 12) relu12 | (512, 12, 12)=> (512, 12, 12) conv13 | (512, 12, 12)=> (512, 12, 12) relu13 | (512, 12, 12)=> (512, 12, 12) maxpool10 | (512, 12, 12)=> (512, 6, 6) upsampling1 | (512, 6, 6)=> (512, 12, 12) conv14 | (512, 12, 12)=> (512, 12, 12) relu14 | (512, 12, 12)=> (512, 12, 12) conv15 | (512, 12, 12)=> (512, 12, 12) relu15 | (512, 12, 12)=> (512, 12, 12) conv16 | (512, 12, 12)=> (512, 12, 12) relu16 | (512, 12, 12)=> (512, 12, 12) upsampling2 | (512, 12, 12)=> (512, 24, 24) conv17 | (512, 24, 24)=> (512, 24, 24) relu17 | (512, 24, 24)=> (512, 24, 24) conv18 | (512, 24, 24)=> (512, 24, 24) relu18 | (512, 24, 24)=> (512, 24, 24) conv19 | (512, 24, 24)=> (256, 24, 24) relu19 | (256, 24, 24)=> (256, 24, 24) upsampling3 | (256, 24, 24)=> (256, 48, 48) conv20 | (256, 48, 48)=> (256, 48, 48) relu20 | (256, 48, 48)=> (256, 48, 48) conv21 | (256, 48, 48)=> (256, 48, 48) relu21 | (256, 48, 48)=> (256, 48, 48) conv22 | (256, 48, 48)=> (128, 48, 48) relu22 | (128, 48, 48)=> (128, 48, 48) upsampling4 | (128, 48, 48)=> (128, 96, 96) conv23 | (128, 96, 96)=> (128, 96, 96) relu23 | (128, 96, 96)=> (128, 96, 96) conv24 | (128, 96, 96)=> (64, 96, 96) relu24 | (64, 96, 96)=> (64, 96, 96) upsampling5 | (64, 96, 96)=> (64, 192, 192) conv25 | (64, 192, 192)=> (64, 192, 192) relu25 | (64, 192, 192)=> (64, 192, 192) conv26 | (64, 192, 192)=> (1, 192, 192) sigmoid26 | (1, 192, 192)=> (1, 192, 192)

Forward start
forward = 2984 msec

gsutra commented 3 years ago

Hi @RParedesPalacios,

Thanks a lot for your time. I am certainly missing compile flag. I am building eddl on windows, and option -march=native is not available with compiler msvc. I tried some options but dit not improve 😢

For windows users, here are my attempts :

/favor=INTEL64 => build ok, run ok, but no speedup
/arch=AVX512 => build ok, run NOT ok (some unit tests fail, and other crash, call from ecvl crash)

gsutra commented 3 years ago

Flag /arch=AVX2 instead of /arch=AVX512 seems to work, I now have about 5 seconds instead of 30 seconds 👍

Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz

deephealthproject / eddl

Slow CNN on CPU #201

Generating Random Table CS with full memory setup Building model

model