DNN CPU speed - Githubissues

gnawice commented 7 years ago

Has anyone benchmarked DNN speeds for CPU? With the lack of a model zoo and inability to convert models, I don't have a good feel for any direct comparisons (say on VGG16) to any other packages. Trying to decide if it is worth the effort to make a converter from dlib to another package for comparison.

davisking commented 7 years ago

I don't think anyone has spent much time running on the CPU. So I'm sure there is room to speed things up. All you need to do is to make VGG16 and run some forward backward passes to see how fast it is though.

e-fominov commented 7 years ago

I am using CPU very much. Training is done with GPU, and production networks are running with CPU. The main hint is to use BLAS libraries (OpenBLAS/MKL) - CPU DNN module performance increases about 20x Most time on convolution layer is spent on GEMM (about 90%) and this operation is very slow without BLAS library

davisking commented 7 years ago

Yeah, you definitely need to enable BLAS. But even so, I suspect very much that things can be made faster. The whole "use BLAS to do convolutions" thing bothers me and can't be the fastest way to do it. I would be very surprised if something like the SIMD enabled dlib::spatially_filter_image() isn't better than BLAS for this.

OranjeeGeneral commented 7 years ago

I was wondering if it is worth to look into adding support for Intel MKL-DNN (the free one not the one that comes with MKL, I know there is confusion there thanks Intel). Surely the other low hanging fruit would be to add OMP support to some of the computation layers and loss functions.

davisking commented 7 years ago

Oh cool. I didn't know there was a DNN library from Intel. Yeah, making dlib use that is probably a good idea when it's not using CUDA. It would have to be done as an optional switch just like the BLAS support is done now so that if it's not there everything works anyway. But that's not a big deal to setup. So if you want to look into that I'm all for it.

OranjeeGeneral commented 7 years ago

Yep check this out:

https://01.org/mkl-dnn

has gone public just around the end of last year. Well for me personal it is not super urgent (got lots of other things first) but it certainly something that crossed my mind especially if you're considering using XeonPhi as a platform. So if someone else is keen on doing I certainly will not hold them back :)

edmBernard commented 7 years ago

For CPU speed up there is also NNPACK (https://github.com/Maratyszcza/NNPACK) Instead of MKL-DNN, you can run it on all platform.

davisking commented 7 years ago

Cool, NNPACK seems nice too. The build system looks kinda wack though.

edmBernard commented 7 years ago

yeah :( ninja work fine but it's another dependency I don't know if we can use make

OranjeeGeneral commented 7 years ago

Hmm interesting never heard of this one before it is nice it also has ARM but I wonder how it stacks up against Intel MKL-DNN. Sure the Intel boys know best how to squeeze performance out of their CPU.

edmBernard commented 7 years ago

This thread in mxnet show how there add MKL DNN and NNPACK support : https://github.com/dmlc/mxnet/issues/2986 and there is a small benchmark

davisking commented 7 years ago

Yeah, it seems like the intel one will be faster. But the ARM support is nice. Like they are saying in the mxnet thread, maybe support for both would be nice.

headupinclouds commented 7 years ago

👍 for Neon support. Here are another couple of options:

The SIMD (SSE*, NEON, and Altivec) project seems to have an active CNN module: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdNeural.hpp
Apple BNNS (Accelerate/CPU) and MPSCNN (Metal/GPU) http://machinethink.net/blog/apple-deep-learning-bnns-versus-metal-cnn/ (Apple only, but probably hard to beat for iOS)

travek commented 7 years ago

Hi!

I'm using MS VS 2015. I install Intel MKL library. I compiled my project that uses Dlib deep learning with #DLIB_USEB_BLAS. I'm using inception architecture in my project similar to the one that is in dlib example. I noticed that using BLAS is not improving the speed of deep net (inception) training. Is is expected result or others have speed up learning spped with BLAS on CPU ?

e-fominov commented 7 years ago

MKL will improve DNN performance on CPU very much (depends on network structure and data dimensions). There is no option #DLIB_USEB_BLAS in comiling Dlib to support it MKL support will be enabled automatically when building Dlib and you should see message in CMake output "Found MKL". If you dont see messages about MKL, this means it is not found and not used

Looks like you are building some custom project without CMake and this can be a reason of this problem. CMake scripts should automatically manage MKL linking

The other possible reason - you are training on GPU

I can recommend you to follow this tutorial

travek commented 7 years ago

@e-fominov 1) Yes, I didn't use Cmake. To link with Intel MKL I follow "Compiling and Linking Intel® Math Kernel Library with Microsoft* Visual C++" document. So I choose in MS VS: Properties» Configuration properties » Intel Performance Libraries » Use Intel® MKL -> parallel Properties » Configuration properties » C/C++ » Code Generation -> Multi-threaded (/MT). Yes, I used #define DLIB_USE_BLAS. I'm not training on GPU.

How I can check that the complied application is using MKL functionality ?

e-fominov commented 7 years ago

If you are building with MSVC IDE, you should be linking with dlib.lib library. MKL should be used at dlib.lib building stage, not at final application building Try building dlib.lib with CMake (and make sure you see that MKL is found and used) and then use it with MSVC

Anyway, CMake is the only one officially supported building tool

travek commented 7 years ago

@e-fominov you're writing about buikding with dlib.lib. Why can't I just add source.cpp to the project as it is described in How to build with MSVC document of dlib ?

e-fominov commented 7 years ago

You can use the source.cpp, it should work (and as I understand, it works). Source.cpp is the most simple way to use dlib, it has some limitations, and one of them - managing dependencies should be done by you

Also, please check that DLIB_USE_BLAS is defined globally as compiler switch, not as "#define" in some source/header file

Anyway, you were talking about perfromance. Good way to check it - use CMake to compile examples with MKL and without it and compare with your timings

MyraBaba commented 6 years ago

Hi,

I am early flying baby C++ learner. I try to make web_cam_face_detection multithreading.

I wrote 👍

void dot_product(int i,  std::vector<std::vector<rectangle>> &d3_faces , shape_predictor &pose_model,
                 frontal_face_detector &detector, cv_image<bgr_pixel> &cimg_small) {

    std::thread::id this_id = std::this_thread::get_id();
    std::vector<rectangle> tmpFaces;
    rectangle rect_1 (0,0,0,0);
    std::vector<rectangle> rect ;
    rect.push_back(rect_1);
    tmpFaces = detector(cimg_small);
    std::cout << "thread  " << this_id << "  BULDU..." << tmpFaces.size() << endl;
    if (!tmpFaces.empty()) {
        for (std::vector<rectangle>::const_iterator i = tmpFaces.begin(); i != tmpFaces.end(); ++i) {
            std::cout << *i << ' ' << endl;
            d3_faces.push_back(tmpFaces);
        }
    } else{
        cout << "    bulamadim   "  << tmpFaces.size() << endl;
        d3_faces.push_back(rect);
        cout<<endl;
    }
}

in the main:

for (int i = 0; i <= nr_threads; ++i) {

                cap >> im;

                cv::resize(im, im_small, cv::Size(), 1.0 * FACE_DOWNSAMPLE_RATIO, 1.0 * FACE_DOWNSAMPLE_RATIO);

                cv_image<bgr_pixel> cimg_small(im_small);
                converted_frames.push_back(cimg_small);

                threads.push_back(std::thread(
                        dot_product, i , std::ref(d3_faces) , std::ref(pose_model), std::ref(detector), std::ref(cimg_small))
                );;

            }

I have 8 cores i7 mac book pro.

I set the number of threads = 10

But I didnt see perf improvemnet? still delay and slow. slower than the original :)) What am I missing ? is cv::cap is blocking or the threading concept wrong ?

How can I get benefit from my CPU's . ?

thx

MyraBaba commented 6 years ago

This is my cmake .. result :

alpullu-2:build alpullu$ cmake .. -DUSE_AVX_INSTRUCTIONS=1 -- The C compiler identification is AppleClang 9.0.0.9000038 -- The CXX compiler identification is AppleClang 9.0.0.9000038 -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Enabling AVX instructions -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - found -- Found Threads: TRUE
-- Looking for png_create_read_struct -- Looking for png_create_read_struct - found -- Looking for jpeg_read_header -- Looking for jpeg_read_header - found -- Searching for BLAS and LAPACK -- Searching for BLAS and LAPACK -- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") -- Checking for module 'cblas' -- No package 'cblas' found -- Checking for module 'lapack' -- No package 'lapack' found -- Looking for sys/types.h -- Looking for sys/types.h - found -- Looking for stdint.h -- Looking for stdint.h - found -- Looking for stddef.h -- Looking for stddef.h - found -- Check size of void -- Check size of void - done -- Found LAPACK library -- Found CBLAS library -- Looking for cblas_ddot -- Looking for cblasddot - found -- Looking for sgesv -- Looking for sgesv - found -- Looking for sgesv -- Looking for sgesv_ - found CUDA_TOOLKIT_ROOT_DIR not found or specified -- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "7.5") -- Disabling CUDA support for dlib. DLIB WILL NOT USE CUDA -- Building a C++11 test project to see if your compiler supports C++11 -- C++11 activated. -- Building a C++11 test project to see if your compiler supports C++11 -- C++11 activated. -- Configuring done -- Generating done -- Build files have been written to: /Users/alpullu/Downloads/dlib-19.8/build

edmBernard commented 6 years ago

@MyraBaba Can you open a new issue ? I think you will have more help with a better title that fit your problem

MyraBaba commented 6 years ago

Hi, I installed a ubuntu VM and start over all including MKL installation.

UbuntuVM worked like a charm no problem.

The problem is Mac OS X. I installed MKL and all required lib but dlib couldnt pick the lib. There is something missed. I spend almost whole day until decide to move ubuntu.

But Mac OS X dlib and Intel MKL is really pain in the ass..

If there is a solution I would be very appareciated..

Do we need to open a new issue for this ?

thx

davisking commented 6 years ago

You need to fill out the issue template if you want help. Also, http://dlib.net/faq.html#Itdoesntwork

dlib-issue-bot commented 6 years ago

Warning: this issue has been inactive for 252 days and will be automatically closed on 2018-09-07 if there is no further activity.

If you are waiting for a response but haven't received one it's likely your question is somehow inappropriate. E.g. you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's documentation, or a Google search.

dlib-issue-bot commented 6 years ago

Notice: this issue has been closed because it has been inactive for 257 days. You may reopen this issue if it has been closed in error.