Closed gsutra closed 3 years ago
Hi, i will check that with CPU as well and I will report asap
Hi, i have checked SegNet over CPU and I got very similar results, 43 secs for the forward. I speedup it 4x using the -march=native flag in compilation:
cmake .. -D CMAKE_CXX_FLAGS="-march=native"
Unfortunately in some cases this compilation flag lead to segfaults on some architectures. For this reason we have several issues to consider for the CPU implementation, reported by BSC group, that still are not applied but we are on the way.
Regards
Linux, Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz. (12 threads)
I have run a SegNet with batch_size=1 to simulate the forward of a unique image that I think is exactly what you are doing in your example. I obtained 2.9 seconds.
CMake:
cmake .. -D CMAKE_CXX_FLAGS="-march=native"
In any case this is my code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <chrono>
#include "eddl/apis/eddl.h"
using namespace eddl;
using namespace std::chrono;
layer SegNet(layer x, const int& num_classes)
{
x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
x = MaxPool(x, { 2,2 }, { 2,2 });
x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
x = MaxPool(x, { 2,2 }, { 2,2 });
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = MaxPool(x, { 2,2 }, { 2,2 });
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = MaxPool(x, { 2,2 }, { 2,2 });
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = MaxPool(x, { 2,2 }, { 2,2 });
x = UpSampling(x, { 2,2 });
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = UpSampling(x, { 2,2 });
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 512, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = UpSampling(x, { 2,2 });
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 256, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
x = UpSampling(x, { 2,2 });
x = ReLu(Conv(x, 128, { 3,3 }, { 1, 1 }, "same"));
x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
x = UpSampling(x, { 2,2 });
x = ReLu(Conv(x, 64, { 3,3 }, { 1, 1 }, "same"));
x = Conv(x, num_classes, { 3,3 }, { 1,1 }, "same");
return x;
}
int main()
{
// Settings
int batch_size = 1;
int num_classes = 1;
std::vector<int> size{ 192, 192 }; // Size of images
// Define network
layer in = Input({ 3, size[0], size[1] });
layer out = SegNet(in, num_classes);
layer out_sigm = Sigmoid(out);
model net = Model({ in }, { out_sigm });
// Build model
build(net,
adam(0.0001f), //Optimizer
{ "cross_entropy" }, // Losses
{ "mean_squared_error" } // Metrics
);
//toGPU(net);
// View model
summary(net);
plot(net, "model.pdf");
// Prepare tensors which store batch
Tensor* x = new Tensor({ batch_size, 3 , size[0], size[1] });
Tensor* y = new Tensor({ batch_size, 1 , size[0], size[1] });
cout<<"Forward start"<<endl;
auto t1 = steady_clock::now();
forward(net, { x });
auto t2 = steady_clock::now();
cout << "forward = " << duration_cast<milliseconds>(t2 - t1).count() << " msec" << endl;
}
OUTPUT:
Generating Random Table CS with full memory setup Building model
model
input1 | (3, 192, 192)=> (3, 192, 192) conv1 | (3, 192, 192)=> (64, 192, 192) relu1 | (64, 192, 192)=> (64, 192, 192) conv2 | (64, 192, 192)=> (64, 192, 192) relu2 | (64, 192, 192)=> (64, 192, 192) maxpool2 | (64, 192, 192)=> (64, 96, 96) conv3 | (64, 96, 96)=> (128, 96, 96) relu3 | (128, 96, 96)=> (128, 96, 96) conv4 | (128, 96, 96)=> (128, 96, 96) relu4 | (128, 96, 96)=> (128, 96, 96) maxpool4 | (128, 96, 96)=> (128, 48, 48) conv5 | (128, 48, 48)=> (256, 48, 48) relu5 | (256, 48, 48)=> (256, 48, 48) conv6 | (256, 48, 48)=> (256, 48, 48) relu6 | (256, 48, 48)=> (256, 48, 48) conv7 | (256, 48, 48)=> (256, 48, 48) relu7 | (256, 48, 48)=> (256, 48, 48) maxpool6 | (256, 48, 48)=> (256, 24, 24) conv8 | (256, 24, 24)=> (512, 24, 24) relu8 | (512, 24, 24)=> (512, 24, 24) conv9 | (512, 24, 24)=> (512, 24, 24) relu9 | (512, 24, 24)=> (512, 24, 24) conv10 | (512, 24, 24)=> (512, 24, 24) relu10 | (512, 24, 24)=> (512, 24, 24) maxpool8 | (512, 24, 24)=> (512, 12, 12) conv11 | (512, 12, 12)=> (512, 12, 12) relu11 | (512, 12, 12)=> (512, 12, 12) conv12 | (512, 12, 12)=> (512, 12, 12) relu12 | (512, 12, 12)=> (512, 12, 12) conv13 | (512, 12, 12)=> (512, 12, 12) relu13 | (512, 12, 12)=> (512, 12, 12) maxpool10 | (512, 12, 12)=> (512, 6, 6) upsampling1 | (512, 6, 6)=> (512, 12, 12) conv14 | (512, 12, 12)=> (512, 12, 12) relu14 | (512, 12, 12)=> (512, 12, 12) conv15 | (512, 12, 12)=> (512, 12, 12) relu15 | (512, 12, 12)=> (512, 12, 12) conv16 | (512, 12, 12)=> (512, 12, 12) relu16 | (512, 12, 12)=> (512, 12, 12) upsampling2 | (512, 12, 12)=> (512, 24, 24) conv17 | (512, 24, 24)=> (512, 24, 24) relu17 | (512, 24, 24)=> (512, 24, 24) conv18 | (512, 24, 24)=> (512, 24, 24) relu18 | (512, 24, 24)=> (512, 24, 24) conv19 | (512, 24, 24)=> (256, 24, 24) relu19 | (256, 24, 24)=> (256, 24, 24) upsampling3 | (256, 24, 24)=> (256, 48, 48) conv20 | (256, 48, 48)=> (256, 48, 48) relu20 | (256, 48, 48)=> (256, 48, 48) conv21 | (256, 48, 48)=> (256, 48, 48) relu21 | (256, 48, 48)=> (256, 48, 48) conv22 | (256, 48, 48)=> (128, 48, 48) relu22 | (128, 48, 48)=> (128, 48, 48) upsampling4 | (128, 48, 48)=> (128, 96, 96) conv23 | (128, 96, 96)=> (128, 96, 96) relu23 | (128, 96, 96)=> (128, 96, 96) conv24 | (128, 96, 96)=> (64, 96, 96) relu24 | (64, 96, 96)=> (64, 96, 96) upsampling5 | (64, 96, 96)=> (64, 192, 192) conv25 | (64, 192, 192)=> (64, 192, 192) relu25 | (64, 192, 192)=> (64, 192, 192) conv26 | (64, 192, 192)=> (1, 192, 192) sigmoid26 | (1, 192, 192)=> (1, 192, 192)
Forward start
forward = 2984 msec
Hi @RParedesPalacios,
Thanks a lot for your time. I am certainly missing compile flag. I am building eddl on windows, and option -march=native
is not available with compiler msvc
. I tried some options but dit not improve 😢
For windows users, here are my attempts :
/favor=INTEL64
=> build ok, run ok, but no speedup/arch=AVX512
=> build ok, run NOT ok (some unit tests fail, and other crash, call from ecvl crash)Flag /arch=AVX2
instead of /arch=AVX512
seems to work, I now have about 5 seconds instead of 30 seconds 👍
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Hi,
I am writing code for image segmentation, based on sample code from https://github.com/deephealthproject/use_case_pipeline/blob/master/cpp/skin_lesion_segmentation_inference.cpp
I am using the trained model provided on https://github.com/deephealthproject/use_case_pipeline : isic_segm_segnet_adam_lr_0.0001_loss_ce_size_192_epoch_24.bin This model is composed of 5 downsampling blocks (convs+relu+maxpool) and 5 upsampling blocks (upsamp+convs+relu), leading to 29,427,713 parameters.
The segmentation results are good 👍 , but inference is slow. Calling the method
forward
on CPU is taking approximately 30 seconds on my laptop. I find it slow, given that input images are resized as 192x192, but maybe I am missing some important point to speedup the inference ?Here are the elapsed times for my code. Note the time for
forward
: 31490 milliseconds.Below is my cpp code :