davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.53k stars 3.37k forks source link

DNN CPU speed #517

Closed gnawice closed 6 years ago

gnawice commented 7 years ago

Has anyone benchmarked DNN speeds for CPU? With the lack of a model zoo and inability to convert models, I don't have a good feel for any direct comparisons (say on VGG16) to any other packages. Trying to decide if it is worth the effort to make a converter from dlib to another package for comparison.

davisking commented 7 years ago

I don't think anyone has spent much time running on the CPU. So I'm sure there is room to speed things up. All you need to do is to make VGG16 and run some forward backward passes to see how fast it is though.

e-fominov commented 7 years ago

I am using CPU very much. Training is done with GPU, and production networks are running with CPU. The main hint is to use BLAS libraries (OpenBLAS/MKL) - CPU DNN module performance increases about 20x Most time on convolution layer is spent on GEMM (about 90%) and this operation is very slow without BLAS library

davisking commented 7 years ago

Yeah, you definitely need to enable BLAS. But even so, I suspect very much that things can be made faster. The whole "use BLAS to do convolutions" thing bothers me and can't be the fastest way to do it. I would be very surprised if something like the SIMD enabled dlib::spatially_filter_image() isn't better than BLAS for this.

OranjeeGeneral commented 7 years ago

I was wondering if it is worth to look into adding support for Intel MKL-DNN (the free one not the one that comes with MKL, I know there is confusion there thanks Intel). Surely the other low hanging fruit would be to add OMP support to some of the computation layers and loss functions.

davisking commented 7 years ago

Oh cool. I didn't know there was a DNN library from Intel. Yeah, making dlib use that is probably a good idea when it's not using CUDA. It would have to be done as an optional switch just like the BLAS support is done now so that if it's not there everything works anyway. But that's not a big deal to setup. So if you want to look into that I'm all for it.

OranjeeGeneral commented 7 years ago

Yep check this out:

https://01.org/mkl-dnn

has gone public just around the end of last year. Well for me personal it is not super urgent (got lots of other things first) but it certainly something that crossed my mind especially if you're considering using XeonPhi as a platform. So if someone else is keen on doing I certainly will not hold them back :)

edmBernard commented 7 years ago

For CPU speed up there is also NNPACK (https://github.com/Maratyszcza/NNPACK) Instead of MKL-DNN, you can run it on all platform.

davisking commented 7 years ago

Cool, NNPACK seems nice too. The build system looks kinda wack though.

edmBernard commented 7 years ago

yeah :( ninja work fine but it's another dependency I don't know if we can use make

OranjeeGeneral commented 7 years ago

Hmm interesting never heard of this one before it is nice it also has ARM but I wonder how it stacks up against Intel MKL-DNN. Sure the Intel boys know best how to squeeze performance out of their CPU.

edmBernard commented 7 years ago

This thread in mxnet show how there add MKL DNN and NNPACK support : https://github.com/dmlc/mxnet/issues/2986 and there is a small benchmark

davisking commented 7 years ago

Yeah, it seems like the intel one will be faster. But the ARM support is nice. Like they are saying in the mxnet thread, maybe support for both would be nice.

headupinclouds commented 7 years ago

đź‘Ť for Neon support. Here are another couple of options:

travek commented 7 years ago

Hi!

I'm using MS VS 2015. I install Intel MKL library. I compiled my project that uses Dlib deep learning with #DLIB_USEB_BLAS. I'm using inception architecture in my project similar to the one that is in dlib example. I noticed that using BLAS is not improving the speed of deep net (inception) training. Is is expected result or others have speed up learning spped with BLAS on CPU ?

e-fominov commented 7 years ago

MKL will improve DNN performance on CPU very much (depends on network structure and data dimensions). There is no option #DLIB_USEB_BLAS in comiling Dlib to support it MKL support will be enabled automatically when building Dlib and you should see message in CMake output "Found MKL". If you dont see messages about MKL, this means it is not found and not used

Looks like you are building some custom project without CMake and this can be a reason of this problem. CMake scripts should automatically manage MKL linking

The other possible reason - you are training on GPU

I can recommend you to follow this tutorial

travek commented 7 years ago

@e-fominov 1) Yes, I didn't use Cmake. To link with Intel MKL I follow "Compiling and Linking Intel® Math Kernel Library with Microsoft* Visual C++" document. So I choose in MS VS: Properties» Configuration properties » Intel Performance Libraries » Use Intel® MKL -> parallel Properties » Configuration properties » C/C++ » Code Generation -> Multi-threaded (/MT). Yes, I used #define DLIB_USE_BLAS. I'm not training on GPU.

How I can check that the complied application is using MKL functionality ?

e-fominov commented 7 years ago

If you are building with MSVC IDE, you should be linking with dlib.lib library. MKL should be used at dlib.lib building stage, not at final application building Try building dlib.lib with CMake (and make sure you see that MKL is found and used) and then use it with MSVC

Anyway, CMake is the only one officially supported building tool

travek commented 7 years ago

@e-fominov you're writing about buikding with dlib.lib. Why can't I just add source.cpp to the project as it is described in How to build with MSVC document of dlib ?

e-fominov commented 7 years ago

You can use the source.cpp, it should work (and as I understand, it works). Source.cpp is the most simple way to use dlib, it has some limitations, and one of them - managing dependencies should be done by you

Also, please check that DLIB_USE_BLAS is defined globally as compiler switch, not as "#define" in some source/header file

Anyway, you were talking about perfromance. Good way to check it - use CMake to compile examples with MKL and without it and compare with your timings

MyraBaba commented 6 years ago

Hi,

I am early flying baby C++ learner. I try to make web_cam_face_detection multithreading.

I wrote đź‘Ť

void dot_product(int i,  std::vector<std::vector<rectangle>> &d3_faces , shape_predictor &pose_model,
                 frontal_face_detector &detector, cv_image<bgr_pixel> &cimg_small) {

    std::thread::id this_id = std::this_thread::get_id();
    std::vector<rectangle> tmpFaces;
    rectangle rect_1 (0,0,0,0);
    std::vector<rectangle> rect ;
    rect.push_back(rect_1);
    tmpFaces = detector(cimg_small);
    std::cout << "thread  " << this_id << "  BULDU..." << tmpFaces.size() << endl;
    if (!tmpFaces.empty()) {
        for (std::vector<rectangle>::const_iterator i = tmpFaces.begin(); i != tmpFaces.end(); ++i) {
            std::cout << *i << ' ' << endl;
            d3_faces.push_back(tmpFaces);
        }
    } else{
        cout << "    bulamadim   "  << tmpFaces.size() << endl;
        d3_faces.push_back(rect);
        cout<<endl;
    }
}

in the main:

for (int i = 0; i <= nr_threads; ++i) {

                cap >> im;

                cv::resize(im, im_small, cv::Size(), 1.0 * FACE_DOWNSAMPLE_RATIO, 1.0 * FACE_DOWNSAMPLE_RATIO);

                cv_image<bgr_pixel> cimg_small(im_small);
                converted_frames.push_back(cimg_small);

                threads.push_back(std::thread(
                        dot_product, i , std::ref(d3_faces) , std::ref(pose_model), std::ref(detector), std::ref(cimg_small))
                );;

            }

I have 8 cores i7 mac book pro.

I set the number of threads = 10

But I didnt see perf improvemnet? still delay and slow. slower than the original :)) What am I missing ? is cv::cap is blocking or the threading concept wrong ?

How can I get benefit from my CPU's . ?

thx

MyraBaba commented 6 years ago

This is my cmake .. result :

alpullu-2:build alpullu$ cmake .. -DUSE_AVX_INSTRUCTIONS=1 -- The C compiler identification is AppleClang 9.0.0.9000038 -- The CXX compiler identification is AppleClang 9.0.0.9000038 -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Enabling AVX instructions -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - found -- Found Threads: TRUE
-- Looking for png_create_read_struct -- Looking for png_create_read_struct - found -- Looking for jpeg_read_header -- Looking for jpeg_read_header - found -- Searching for BLAS and LAPACK -- Searching for BLAS and LAPACK -- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") -- Checking for module 'cblas' -- No package 'cblas' found -- Checking for module 'lapack' -- No package 'lapack' found -- Looking for sys/types.h -- Looking for sys/types.h - found -- Looking for stdint.h -- Looking for stdint.h - found -- Looking for stddef.h -- Looking for stddef.h - found -- Check size of void -- Check size of void - done -- Found LAPACK library -- Found CBLAS library -- Looking for cblas_ddot -- Looking for cblasddot - found -- Looking for sgesv -- Looking for sgesv - found -- Looking for sgesv -- Looking for sgesv_ - found CUDA_TOOLKIT_ROOT_DIR not found or specified -- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "7.5") -- Disabling CUDA support for dlib. DLIB WILL NOT USE CUDA -- Building a C++11 test project to see if your compiler supports C++11 -- C++11 activated. -- Building a C++11 test project to see if your compiler supports C++11 -- C++11 activated. -- Configuring done -- Generating done -- Build files have been written to: /Users/alpullu/Downloads/dlib-19.8/build

edmBernard commented 6 years ago

@MyraBaba Can you open a new issue ? I think you will have more help with a better title that fit your problem

MyraBaba commented 6 years ago

Hi, I installed a ubuntu VM and start over all including MKL installation.

UbuntuVM worked like a charm no problem.

The problem is Mac OS X. I installed MKL and all required lib but dlib couldnt pick the lib. There is something missed. I spend almost whole day until decide to move ubuntu.

But Mac OS X dlib and Intel MKL is really pain in the ass..

If there is a solution I would be very appareciated..

Do we need to open a new issue for this ?

thx

davisking commented 6 years ago

You need to fill out the issue template if you want help. Also, http://dlib.net/faq.html#Itdoesntwork

dlib-issue-bot commented 6 years ago

Warning: this issue has been inactive for 252 days and will be automatically closed on 2018-09-07 if there is no further activity.

If you are waiting for a response but haven't received one it's likely your question is somehow inappropriate. E.g. you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's documentation, or a Google search.

dlib-issue-bot commented 6 years ago

Notice: this issue has been closed because it has been inactive for 257 days. You may reopen this issue if it has been closed in error.

pfeatherstone commented 4 years ago

CPU speed mentioned here https://github.com/davisking/dlib/issues/2211#issuecomment-708704174

arrufat commented 4 years ago

@pfeatherstone you might want to have a lookt at this. The interesting part is that they have a unified backend, which means you implement your algorithms with array fire, and you can choose at runtime whether to use CPU, CUDA or OpenCL: https://arrayfire.com/why-arrayfire/

pfeatherstone commented 4 years ago

@arrufat Yeah i've seen arrayfire before. It looks really good. It's used by https://github.com/facebookresearch/flashlight. However, my motivation for using dlib for dnn stuff is to have a highly portable library with very minimal dependencies. Arrayfire looks like a very heavy dependency.... If i'm willing to link to large dependencies, the best i've used so far is onnxruntime.

pfeatherstone commented 4 years ago

Just noticed that the CPU backend just uses openblas and fftw. So in terms of CPU performance, I imagine convolutions are just as fast as calling openblas directly, which is what dlib does in theory.

pfeatherstone commented 4 years ago

For GPU performance, both arrayfire and dlib call CUDNN. So again, convolutions should be near identical.

pfeatherstone commented 4 years ago

So don't know where the bottlenecks are in both CPU and GPU code paths.

davisking commented 4 years ago

Honestly using a BLAS library to implement convolution is just an easy hack. Anything that does it that way will be kinda slow. It’s just like that in dlib (and some other libraries) because it’s super easy and people care more about GPU performance.

Like for instance, the image processing part of dlib has several convolution functions and they are implemented using different much faster methods. But why don’t support all the various stride and padding options.