Dobiasd / frugally-deep

A lightweight header-only library for using Keras (TensorFlow) models in C++.
MIT License
1.07k stars 237 forks source link

Using Eigen Unsupported modules to improve convolutions #167

Closed pfeatherstone closed 5 years ago

pfeatherstone commented 5 years ago

I noticed that Eigen 3.3 has unsupported modules, including modules for Tensors and gemm operations.

https://bitbucket.org/eigen/eigen/src/9b065de03d016d802a25366ff5f0055df6318121/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default#markdown-header-convolutions

I noticed you implement your own gemm operation in fdeep/convolution.hpp in function convolve_im2col. This could be improved by using gemm functions from the eigen unsupported modules.

I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.

Thanks

pfeatherstone commented 5 years ago

In fact using the eigen::Tensor class might be a more efficient container than eigen::Matrix since Tensor can be used to keep track of operations without doing any evaluations. ArrayFire uses a similar concept. Just a thought.

Dobiasd commented 5 years ago

I noticed you implement your own gemm operation in fdeep/convolution.hpp in function convolve_im2col.

No, I don't. :) The loops in this function are just the im2col conversion. The actual GEMM is a call to Eigen: https://github.com/Dobiasd/frugally-deep/blob/3eafef23d63594049e4975106798e31342f22c96/include/fdeep/convolution.hpp#L127


I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.

18 s vs. 3 s is quite a big difference.

Did you enable all compiler optimizations for speed (-O3)? Did you allow your compiler to fully use the latest vectorizing instructions your CPU provides (-march=native)? Did OpenCV utilize multiple CPU cores? Did OpenCV utilize a GPU?

I'm asking all this, because usually Eigen's GEMM operations are very fast.


Nevertheless, in two weeks I'll be able to have a deeper look at your recent suggestions.

pfeatherstone commented 5 years ago

Oops. I misread the code. Regarding OpenCV, it was a CPU-only build. It's possible it was using OpenMP though. Regarding fdeep, I always use "-Ofast -march=native" as compiler options. So the best it can be. The OpenCV DNN module is insanely fast to be fair though.

pfeatherstone commented 5 years ago

I noticed https://github.com/bcaine/nn_cpp is using Eigen::Tensor as a backend container. Maybe Tensorflow is using it too. It would be interesting to see how it faired against Eigen::Matrix. Maybe I shouldn't be so lazy and did some tests myself.

Dobiasd commented 5 years ago

I noticed https://github.com/bcaine/nn_cpp is using Eigen::Tensor as a backend container. Maybe Tensorflow is using it too.

It looks like it, yes.

It would be interesting to see how it faired against Eigen::Matrix. Maybe I shouldn't be so lazy and did some tests myself.

Maybe the mini benchmark from issue 166 can serve as a starting point? :slightly_smiling_face:

pfeatherstone commented 5 years ago

As per my comment in issue 166, I will test eigen::tensor and a few other libraries with chained operations when i get back from holiday. The basic building block in neural networks is (conv,batchnorm,relu). So I will start with repeated blocks like that.

Dobiasd commented 5 years ago

It would also be very interesting to learn, of these pre-implemented algorithms can provide results, which are per-pixel identical to what Keras produces.

pfeatherstone commented 5 years ago

I'm getting the feeling that you require the exact same results as Keras down to the precision of floating point numbers, i.e 10^-34. I've noticed that some frameworks don't always give the exact same results. I don't know if that is simply due to floating point precision, fast-match approximations or what but I do sympathise when you're trying to get the exact same results as your training framework.

Dobiasd commented 5 years ago

I'm getting the feeling that you require the exact same results as Keras down to the precision of floating point numbers, i.e 10^-34.

No, it's not that bad. :)

When you load a frugally-deep model, the tests check if the prediction in C++ gives the same results as Keras did for some inputs that were persisted with the model during conversion. Currently the default epsilon for this comparison is 0.0001.

So I'm not worried about floating-point precision. Instead, I guess ready-made convolution libraries will not adhere to same idiosyncratic padding rules (see here and here), that Keras uses. It was quite a pain to reverse-engineer and emulate those. They even can differ between TensorFlow versions and between running the Python scripts on CPU and GPU. To capture those machine differences, these wild checks are done during model conversion.

Thus, when we find a GEMM lib, which is faster than eigen, we can just replace that one call. However, I expect replacing the whole convolution (or even chains of convolution) would break the compatibility to Keras. And I'm not willing to sacrifice this. The issue section here would be flooded with "fdeep gives the wrong results" posts. :grimacing:

Dobiasd commented 5 years ago

Thus it might make sense to have a closer look at Eigen Tensors. :slightly_smiling_face:

Dobiasd commented 5 years ago

Playing around with Eigen::Tensor::convolve, it seems to be slower to what TensorFlow is doing:

# conv2d_performance_test.py
import datetime

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D
from tensorflow.keras.models import Model

inputs = Input(shape=(1024, 1024, 128))
x = Conv2D(1, (3, 3))(inputs)
model = Model(inputs=inputs, outputs=x)
model.compile(loss='categorical_crossentropy', optimizer='nadam')

data_in = np.random.normal(size=(1, 1024, 1024, 128))
model.predict(data_in)
print(f'tensorflow=={tf.__version__}')
for _ in range(5):
    start_time = datetime.datetime.now()
    model.predict(data_in)
    duration = datetime.datetime.now() - start_time
    print('Forward pass took {} s.'.format(duration.total_seconds()))
CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_test.py
Forward pass took 2.512061 s.
Forward pass took 2.435108 s.
Forward pass took 2.427209 s.
Forward pass took 2.502358 s.
Forward pass took 2.577842 s.

vs.

// eigen_tensor_convolve_multiple_filters.cpp
#include <chrono>
#include <iostream>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>

int main() {
    Eigen::Tensor<float, 3> input(128, 1024, 1024);
    Eigen::Tensor<float, 3> filter(128, 3, 3);

    using namespace std::chrono;
    for (std::size_t run = 0; run < 5; ++run) {
        const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();

        Eigen::array<ptrdiff_t, 3> dims({0, 1, 2});
        Eigen::Tensor<float, 3> output = input.convolve(filter, dims);

        const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
        const auto elapsed_s = ((end_time_ns - start_time_ns) / 1000000) / 1000.0;
        std::cout << "Convolution took " << elapsed_s << " s." << std::endl;
    }
}
g++ -std=c++14 -O3 -mavx eigen_tensor_convolve_multiple_filters.cpp -o eigen_tensor_convolve_multiple_filters
./eigen_tensor_convolve_multiple_filters
Convolution took 3.736 s.
Convolution took 3.633 s.
Convolution took 3.751 s.
Convolution took 3.629 s.
Convolution took 3.756 s.

TensorFlow is using Eigen::SpatialConvolution instead of Eigen::Tensor::convolve. Sadly this Eigen::SpatialConvolution seems to not be part of the official Eigen lib, but it's something custom in the TensorFlow codebase. They are just using the Eigen namespace for that.

Dobiasd commented 5 years ago

Quick status update:

Dobiasd commented 5 years ago

OK, so I just compared the performance of single convolutions (typical VGG19-layer config) in frugally-deep with TensorFlow (Eigen::SpatialConvolution):

//spatial_convolution_test.cpp
#include <chrono>
#include <iostream>

#include <eigen3/Eigen/Core>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>

// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h
#include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h"
#include <fdeep/fdeep.hpp>

// Like in a typical VGG19 layer
const std::size_t k = 512;
const std::size_t x_width = 56;
const std::size_t x_height = 56;
const std::size_t x_depth = 256;
const std::size_t filter_height = 3;
const std::size_t filter_width = 3;
const std::size_t filter_depth = x_depth;

fdeep::internal::conv_2d_layer fdeep_conv_layer(
        "test_conv_layer",
        fdeep::shape5(1, 1, filter_height, filter_width, x_depth),
        k,
        fdeep::internal::shape2(1, 1),
        fdeep::internal::padding::same,
        fdeep::internal::shape2(1, 1),
        fdeep::float_vec(filter_height * filter_width * x_depth * k, 0),
        fdeep::float_vec(k, 0));

const fdeep::tensor5 x_fdeep(fdeep::shape5(1, 1, x_height, x_width, x_depth), 0);

Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width);
Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width);

float fdeep_im2col_conv()
{
    const auto result = fdeep_conv_layer.apply({x_fdeep});
    return result.front().get(0, 0, 0, 0, 0);
}

float eigen_spatial_conv()
{
    const Eigen::Tensor<float, 3> dest = SpatialConvolution(
        x_spatial_conv, filters_spatial_conv);
    return dest(0, 0, 0);
}

template <typename Func>
void measure(const std::string& name, const Func f)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        checksum += f();
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / (runs * 1000000);
    std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}

int main()
{
    measure("frugally-deep convolution (im2col + GEMM)      ", fdeep_im2col_conv);
    measure("TensorFlow Eigen::SpatialConvolution           ", eigen_spatial_conv);
}

Output:

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 143
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 136

So the difference is marginal. It does not explain the huge difference of 0.93 s vs. 0.48 s of a forward pass on a VGG19 model. And it's not that frugally-deep spends all this time outside of the convolution code. At least according to my profiler, it's spending the vast majority of the time there:

8StoFFd (profiling of forward passes on a VGG19 model)

So my conclusion up to now is:

Dobiasd commented 5 years ago

OK, something fishy is going on here. :fish:

The following minimal python benchmark (just one convolution layer)

# conv2d_performance_vgg19_layer.py
import datetime

import numpy as np
from tensorflow.keras.layers import Input, Conv2D
from tensorflow.keras.models import Model

# Like in a typical VGG19 layer
k = 512
x_width = 56
x_height = 56
x_depth = 256
filter_height = 3
filter_width = 3
filter_depth = x_depth

inputs = Input(shape=(x_height, x_width, x_depth))
x = Conv2D(k, (filter_height, filter_width))(inputs)
model = Model(inputs=inputs, outputs=x)
model.compile(loss='categorical_crossentropy', optimizer='nadam')

data_in = np.random.normal(size=(1, x_height, x_width, x_depth))
model.predict(data_in)
duration_s = 0.0
runs = 10
for _ in range(runs):
    start_time = datetime.datetime.now()
    model.predict(data_in)
    duration_s += (datetime.datetime.now() - start_time).total_seconds()
print('Average forward-pass time in seconds: {}'.format(duration_s / runs))

ran like that:

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_vgg19_layer.py

prints (using the default wheel of TensorFlow 2.0.0)

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

and results in:

Average forward-pass time in seconds: 0.0983517

According to htop it's only using one CPU core.

When allowed to use multiple cores like that

CUDA_VISIBLE_DEVICES='' python3 conv2d_performance_vgg19_layer.py

it becomes even faster.

Average forward-pass time in seconds: 0.05571

So it really is only using one CPU core in the test that results in 0.0983517 seconds.


The same convolution in C++, using the original SpatialConvolution code from the TensorFlow repository

//conv2d_performance_vgg19_layer.cpp
#include <chrono>
#include <iostream>

#include <eigen3/Eigen/Core>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>

// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h
#include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h"

// Like in a typical VGG19 layer
const std::size_t k = 512;
const std::size_t x_width = 56;
const std::size_t x_height = 56;
const std::size_t x_depth = 256;
const std::size_t filter_height = 3;
const std::size_t filter_width = 3;
const std::size_t filter_depth = x_depth;

Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width);
Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width);

int main()
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        const Eigen::Tensor<float, 3> dest = SpatialConvolution(
            x_spatial_conv, filters_spatial_conv);
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto mean_elapsed_ms = static_cast<double>(end_time_ns - start_time_ns) / (runs * 1000000000);
    std::cout << "Average convolution time in seconds: " << mean_elapsed_ms << std::endl;
}

compiled and ran like that:

g++ -std=c++14 -w -mavx -O3 conv2d_performance_vgg19_layer.cpp -o conv2d_performance_vgg19_layer_avx
./conv2d_performance_vgg19_layer_avx

results in:

Average convolution time in seconds: 0.134459

So just the convolution alone in C++ takes more time than the whole forward pass in Python.

But I can make it faster by allowing more advanced SIMD instructions:

g++ -std=c++14 -w -march=native -O3 conv2d_performance_vgg19_layer.cpp -o conv2d_performance_vgg19_layer_native
./conv2d_performance_vgg19_layer_native

(I'm using a Intel Core i5-6600 CPU @ 3.30GHz for all these tests.)

The result then is:

Average convolution time in seconds: 0.0744446

So my suspicion is, that the TensorFlow wheel somehow uses AVX2 (or whatever) despite telling that it can not. Maybe some internal CPU detection and function-pointer bending is happening.

Dobiasd commented 5 years ago

So, I've build TensorFlow from source (with g++ 7.1, using -march=native, which is the default in the bazel-based build, according to the documentation) and installed the resulting wheel.

Now

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_vgg19_layer.py

results in

Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
[...]
Average forward-pass time in seconds: 0.096195

It has not become significantly faster. :point_up: (The remark about SSE, etc., again, seems to be incorrect.)

Thus, I, for now, "accuse" the default TensorFlow wheel, which states to not use AVX2/FMA, of "cheating". :grin:

Based on that, it seems to me, the only fair comparison between TensorFlow and frugally-deep is to use -match=native for both. And then it looks much better for frugally-deep. The performance on VGG19 actually is quite similar! (0.48 s vs. 0.54 s)

Conclusion (up to now): Actually TensorFlow likely did not improve its convolution-on-CPU performance significantly between versions 1.13.2 and 1.15.0 like I initially thought. The devs probably just allowed more advanced SIMD instructions in their default binary (by whatever means). Frugally-deep is similarly fast.


For completeness: The spatial_convolution_test.cpp from above compiled with -march=native, gives the following outout:

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 80
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 74

So it confirms the recent observations.

pfeatherstone commented 5 years ago

We can use objdump against the tensorflow library and search for avx2 instructions

pfeatherstone commented 5 years ago

Also have you tried compiling your code with different compilers? I find I get varying performance with gcc and clang. Even between versions of the same compiler actually.

Dobiasd commented 5 years ago

We can use objdump against the tensorflow library and search for avx2 instructions

Sound interesting. In case you already know how do it: Could you check for the default pip wheel of TensorFlow 2.0.0?

Also have you tried compiling your code with different compilers? I find I get varying performance with gcc and clang. Even between versions of the same compiler actually.

Using spatial_convolution_test.cpp from above, it's quite consistant:

g++-5 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 145
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 137
g++-5 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 83
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 76
g++-7 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 144
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 137
g++-7 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 82
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 75
clang++-3.8 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 151
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 143
clang++-3.9 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 85
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 75
clang++-6.0 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 151
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 143
clang++-6.0 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 84
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 75
pfeatherstone commented 5 years ago

I used

objdump -d binary > binary.asm awk '/ \t[ \t]/' binary.asm

from https://superuser.com/questions/726395/how-to-check-if-a-binary-requires-sse4-or-avx-on-linux

against /usr/local/lib/python3.6/dist-packages/tensorflow/libtensorflow_framework.so.1 and got quite a few hits ...

pfeatherstone commented 5 years ago

Sorry the command is:

awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' binary.asm

Dobiasd commented 5 years ago

Thanks. Looking explicitly for AVX2 instructions only, I can confirm: TensorFlow has AVX2 instructions in its non-AVX2 binary.

sudo pip3 uninstall -y tensorflow==2.0.0
sudo pip3 install tensorflow==2.0.0

python3 -c "import tensorflow as tf; print(tf.__version__); a = tf.constant([1, 2]);"

# Look for AVX2 instructions
objdump -d /usr/local/lib/python3.7/dist-packages/tensorflow_core/libtensorflow_framework.so.2 | egrep -i "VBROADCASTSS|VBROADCASTSD|VPBROADCASTB|VPBROADCASTW|VPBROADCASTD|VPBROADCASTQ|VBROADCASTI128|VINSERTI128|VEXTRACTI128|VGATHERDPD|VGATHERQPD|VGATHERDPS|VGATHERQPS|VPGATHERDD|VPGATHERDQ|VPGATHERQD|VPGATHERQQ|VPMASKMOVD|VPMASKMOVQ|VPERMPS|VPERMD|VPERMPD|VPERMQ|VPERM2I128|VPBLENDD|VPSLLVD|VPSLLVQ|VPSRLVD|VPSRLVQ|VPSRAV" | wc -l

Output:

2.0.0

2019-11-18 11:16:29.205307: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-18 11:16:29.230405: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3312000000 Hz
2019-11-18 11:16:29.230657: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52b77b0 executing computations on platform Host. Devices:
2019-11-18 11:16:29.230671: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

424
Dobiasd commented 5 years ago

So, I guess from this we can conclude, there is not much to gain from switching to Eigen::Tensor for us right now.

And regarding the original post:

I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.

This difference was probably caused by OpenCV using multiple CPU cores, right? If so, @pfeatherstone, would you consider this issue as solved?

pfeatherstone commented 5 years ago

Yeah I think you've given this an incredibly thorough look at. I would say this is resolved. Regarding OpenCV, yes it was using all CPU cores. Thank you very much for your time!

Dobiasd commented 5 years ago

Thank you for the nice words and also the original input. I still have the gut feeling that a convolution implementation, which is significantly faster than im2col+GEMM, must be possible somehow. Mainly because of the duplication of values in memory from the overlapping receptive fields. It just seems that nobody has found it yet.

But this is an endeavor I can silently continue to explore in the customconvolution branch when I feel particularly smart. :grin:

Dobiasd commented 5 years ago

For our information: According to this answer on Stack Overflow, Eigen::SpatialConvolution, as implemented in the TensorFlow code, is - at its core - just a normal im2col convolution, that uses Eigen's GEMM implementation. So it's no big surprise that it shows a similar performance profile to fdeep's convolution (also im2col). :smile:

pfeatherstone commented 4 years ago

OK, so I just compared the performance of single convolutions (typical VGG19-layer config) in frugally-deep with TensorFlow (Eigen::SpatialConvolution):

//spatial_convolution_test.cpp
#include <chrono>
#include <iostream>

#include <eigen3/Eigen/Core>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>

// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h
#include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h"
#include <fdeep/fdeep.hpp>

// Like in a typical VGG19 layer
const std::size_t k = 512;
const std::size_t x_width = 56;
const std::size_t x_height = 56;
const std::size_t x_depth = 256;
const std::size_t filter_height = 3;
const std::size_t filter_width = 3;
const std::size_t filter_depth = x_depth;

fdeep::internal::conv_2d_layer fdeep_conv_layer(
        "test_conv_layer",
        fdeep::shape5(1, 1, filter_height, filter_width, x_depth),
        k,
        fdeep::internal::shape2(1, 1),
        fdeep::internal::padding::same,
        fdeep::internal::shape2(1, 1),
        fdeep::float_vec(filter_height * filter_width * x_depth * k, 0),
        fdeep::float_vec(k, 0));

const fdeep::tensor5 x_fdeep(fdeep::shape5(1, 1, x_height, x_width, x_depth), 0);

Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width);
Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width);

float fdeep_im2col_conv()
{
    const auto result = fdeep_conv_layer.apply({x_fdeep});
    return result.front().get(0, 0, 0, 0, 0);
}

float eigen_spatial_conv()
{
    const Eigen::Tensor<float, 3> dest = SpatialConvolution(
        x_spatial_conv, filters_spatial_conv);
    return dest(0, 0, 0);
}

template <typename Func>
void measure(const std::string& name, const Func f)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing everything away
    const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        checksum += f();
    }
    const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / (runs * 1000000);
    std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}

int main()
{
    measure("frugally-deep convolution (im2col + GEMM)      ", fdeep_im2col_conv);
    measure("TensorFlow Eigen::SpatialConvolution           ", eigen_spatial_conv);
}

Output:

frugally-deep convolution (im2col + GEMM)       (checksum: 0) elapsed_ms: 143
TensorFlow Eigen::SpatialConvolution            (checksum: 0) elapsed_ms: 136

So the difference is marginal. It does not explain the huge difference of 0.93 s vs. 0.48 s of a forward pass on a VGG19 model. And it's not that frugally-deep spends all this time outside of the convolution code. At least according to my profiler, it's spending the vast majority of the time there:

8StoFFd (profiling of forward passes on a VGG19 model)

So my conclusion up to now is:

  • TensorFlow's single convolution is not significantly faster than the one in frugally-deep.
  • TensorFlow nevertheless is roughly twice as fast on a convolution-heavy model.
  • Thus they do something else, e.g., fusing consecutive convolutions or something. I don't know yet. I'm trying to understand their code, but it's not easy for me.

Hi @Dobiasd! Which profiler did you use?

Dobiasd commented 4 years ago

sysprof

pfeatherstone commented 4 years ago

cheers

pfeatherstone commented 4 years ago

@Dobiasd sorry to ask this again, this is totally unrelated to the issue. Can you get sysprof to work with a single application? If i hook to 1 or 2 processes, the number of samples never exceeds 0...

Dobiasd commented 4 years ago

I usually just run sudo sysprof and let it profile my system globally. The results are mostly good enough ™️. 🙂

pfeatherstone commented 4 years ago

Hmm. OKidok. Don't really understand why it doesn't work with a single application. It seems there aren't that many good C++ profilers out there, which is surprising. I found google/orbit, but can't get it to build.

pfeatherstone commented 4 years ago

FYI, using perf with https://github.com/KDAB/hotspot works really well