Closed pfeatherstone closed 5 years ago
In fact using the eigen::Tensor class might be a more efficient container than eigen::Matrix since Tensor can be used to keep track of operations without doing any evaluations. ArrayFire uses a similar concept. Just a thought.
I noticed you implement your own gemm operation in fdeep/convolution.hpp in function convolve_im2col.
No, I don't. :) The loops in this function are just the im2col conversion. The actual GEMM is a call to Eigen: https://github.com/Dobiasd/frugally-deep/blob/3eafef23d63594049e4975106798e31342f22c96/include/fdeep/convolution.hpp#L127
I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.
18 s vs. 3 s is quite a big difference.
Did you enable all compiler optimizations for speed (-O3
)?
Did you allow your compiler to fully use the latest vectorizing instructions your CPU provides (-march=native
)?
Did OpenCV utilize multiple CPU cores?
Did OpenCV utilize a GPU?
I'm asking all this, because usually Eigen's GEMM operations are very fast.
Nevertheless, in two weeks I'll be able to have a deeper look at your recent suggestions.
Oops. I misread the code. Regarding OpenCV, it was a CPU-only build. It's possible it was using OpenMP though. Regarding fdeep, I always use "-Ofast -march=native" as compiler options. So the best it can be. The OpenCV DNN module is insanely fast to be fair though.
I noticed https://github.com/bcaine/nn_cpp is using Eigen::Tensor as a backend container. Maybe Tensorflow is using it too. It would be interesting to see how it faired against Eigen::Matrix. Maybe I shouldn't be so lazy and did some tests myself.
I noticed https://github.com/bcaine/nn_cpp is using Eigen::Tensor as a backend container. Maybe Tensorflow is using it too.
It looks like it, yes.
It would be interesting to see how it faired against Eigen::Matrix. Maybe I shouldn't be so lazy and did some tests myself.
Maybe the mini benchmark from issue 166 can serve as a starting point? :slightly_smiling_face:
As per my comment in issue 166, I will test eigen::tensor and a few other libraries with chained operations when i get back from holiday. The basic building block in neural networks is (conv,batchnorm,relu). So I will start with repeated blocks like that.
It would also be very interesting to learn, of these pre-implemented algorithms can provide results, which are per-pixel identical to what Keras produces.
I'm getting the feeling that you require the exact same results as Keras down to the precision of floating point numbers, i.e 10^-34. I've noticed that some frameworks don't always give the exact same results. I don't know if that is simply due to floating point precision, fast-match approximations or what but I do sympathise when you're trying to get the exact same results as your training framework.
I'm getting the feeling that you require the exact same results as Keras down to the precision of floating point numbers, i.e 10^-34.
No, it's not that bad. :)
When you load a frugally-deep model, the tests check if the prediction in C++ gives the same results as Keras did for some inputs that were persisted with the model during conversion. Currently the default epsilon for this comparison is 0.0001
.
So I'm not worried about floating-point precision. Instead, I guess ready-made convolution libraries will not adhere to same idiosyncratic padding rules (see here and here), that Keras uses. It was quite a pain to reverse-engineer and emulate those. They even can differ between TensorFlow versions and between running the Python scripts on CPU and GPU. To capture those machine differences, these wild checks are done during model conversion.
Thus, when we find a GEMM lib, which is faster than eigen, we can just replace that one call. However, I expect replacing the whole convolution (or even chains of convolution) would break the compatibility to Keras. And I'm not willing to sacrifice this. The issue section here would be flooded with "fdeep gives the wrong results" posts. :grimacing:
Thus it might make sense to have a closer look at Eigen Tensors. :slightly_smiling_face:
Playing around with Eigen::Tensor::convolve
, it seems to be slower to what TensorFlow is doing:
# conv2d_performance_test.py
import datetime
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D
from tensorflow.keras.models import Model
inputs = Input(shape=(1024, 1024, 128))
x = Conv2D(1, (3, 3))(inputs)
model = Model(inputs=inputs, outputs=x)
model.compile(loss='categorical_crossentropy', optimizer='nadam')
data_in = np.random.normal(size=(1, 1024, 1024, 128))
model.predict(data_in)
print(f'tensorflow=={tf.__version__}')
for _ in range(5):
start_time = datetime.datetime.now()
model.predict(data_in)
duration = datetime.datetime.now() - start_time
print('Forward pass took {} s.'.format(duration.total_seconds()))
CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_test.py
Forward pass took 2.512061 s.
Forward pass took 2.435108 s.
Forward pass took 2.427209 s.
Forward pass took 2.502358 s.
Forward pass took 2.577842 s.
vs.
// eigen_tensor_convolve_multiple_filters.cpp
#include <chrono>
#include <iostream>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>
int main() {
Eigen::Tensor<float, 3> input(128, 1024, 1024);
Eigen::Tensor<float, 3> filter(128, 3, 3);
using namespace std::chrono;
for (std::size_t run = 0; run < 5; ++run) {
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
Eigen::array<ptrdiff_t, 3> dims({0, 1, 2});
Eigen::Tensor<float, 3> output = input.convolve(filter, dims);
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_s = ((end_time_ns - start_time_ns) / 1000000) / 1000.0;
std::cout << "Convolution took " << elapsed_s << " s." << std::endl;
}
}
g++ -std=c++14 -O3 -mavx eigen_tensor_convolve_multiple_filters.cpp -o eigen_tensor_convolve_multiple_filters
./eigen_tensor_convolve_multiple_filters
Convolution took 3.736 s.
Convolution took 3.633 s.
Convolution took 3.751 s.
Convolution took 3.629 s.
Convolution took 3.756 s.
TensorFlow is using Eigen::SpatialConvolution
instead of Eigen::Tensor::convolve
. Sadly this Eigen::SpatialConvolution
seems to not be part of the official Eigen lib, but it's something custom in the TensorFlow codebase. They are just using the Eigen
namespace for that.
Quick status update:
Eigen::Tensor::convolve
seems to not support multiple kernels in one call. So multiple runs, with a possible additional performance hit would be needed. But even with just one call (one filter) it was already slower compared to TensorFlow.customconvolution
), trying to minimize cache misses, making things vectorizable (even manually writing SIMD (AVX) code), making common filter sizes/depths known at compile-time, etc. However, I did not reach better performance.Eigen::SpatialConvolution
from TensorFlow is not (yet?) merged upstream (into the Eigen library), but I'll try to reuse it anyways (branch name spatialconvolution
). It's a bit ugly (architecture-wise), but if this allows doubling the convolution performance, it should be worth it.OK, so I just compared the performance of single convolutions (typical VGG19
-layer config) in frugally-deep with TensorFlow (Eigen::SpatialConvolution
):
//spatial_convolution_test.cpp
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Core>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h
#include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h"
#include <fdeep/fdeep.hpp>
// Like in a typical VGG19 layer
const std::size_t k = 512;
const std::size_t x_width = 56;
const std::size_t x_height = 56;
const std::size_t x_depth = 256;
const std::size_t filter_height = 3;
const std::size_t filter_width = 3;
const std::size_t filter_depth = x_depth;
fdeep::internal::conv_2d_layer fdeep_conv_layer(
"test_conv_layer",
fdeep::shape5(1, 1, filter_height, filter_width, x_depth),
k,
fdeep::internal::shape2(1, 1),
fdeep::internal::padding::same,
fdeep::internal::shape2(1, 1),
fdeep::float_vec(filter_height * filter_width * x_depth * k, 0),
fdeep::float_vec(k, 0));
const fdeep::tensor5 x_fdeep(fdeep::shape5(1, 1, x_height, x_width, x_depth), 0);
Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width);
Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width);
float fdeep_im2col_conv()
{
const auto result = fdeep_conv_layer.apply({x_fdeep});
return result.front().get(0, 0, 0, 0, 0);
}
float eigen_spatial_conv()
{
const Eigen::Tensor<float, 3> dest = SpatialConvolution(
x_spatial_conv, filters_spatial_conv);
return dest(0, 0, 0);
}
template <typename Func>
void measure(const std::string& name, const Func f)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
const std::size_t runs = 10;
for (size_t i = 0; i < runs; ++i)
{
checksum += f();
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / (runs * 1000000);
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl;
}
int main()
{
measure("frugally-deep convolution (im2col + GEMM) ", fdeep_im2col_conv);
measure("TensorFlow Eigen::SpatialConvolution ", eigen_spatial_conv);
}
Output:
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 143
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 136
So the difference is marginal. It does not explain the huge difference of 0.93 s
vs. 0.48 s
of a forward pass on a VGG19
model. And it's not that frugally-deep spends all this time outside of the convolution code. At least according to my profiler, it's spending the vast majority of the time there:
(profiling of forward passes on a VGG19
model)
So my conclusion up to now is:
OK, something fishy is going on here. :fish:
The following minimal python benchmark (just one convolution layer)
# conv2d_performance_vgg19_layer.py
import datetime
import numpy as np
from tensorflow.keras.layers import Input, Conv2D
from tensorflow.keras.models import Model
# Like in a typical VGG19 layer
k = 512
x_width = 56
x_height = 56
x_depth = 256
filter_height = 3
filter_width = 3
filter_depth = x_depth
inputs = Input(shape=(x_height, x_width, x_depth))
x = Conv2D(k, (filter_height, filter_width))(inputs)
model = Model(inputs=inputs, outputs=x)
model.compile(loss='categorical_crossentropy', optimizer='nadam')
data_in = np.random.normal(size=(1, x_height, x_width, x_depth))
model.predict(data_in)
duration_s = 0.0
runs = 10
for _ in range(runs):
start_time = datetime.datetime.now()
model.predict(data_in)
duration_s += (datetime.datetime.now() - start_time).total_seconds()
print('Average forward-pass time in seconds: {}'.format(duration_s / runs))
ran like that:
CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_vgg19_layer.py
prints (using the default wheel of TensorFlow 2.0.0
)
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
and results in:
Average forward-pass time in seconds: 0.0983517
According to htop
it's only using one CPU core.
When allowed to use multiple cores like that
CUDA_VISIBLE_DEVICES='' python3 conv2d_performance_vgg19_layer.py
it becomes even faster.
Average forward-pass time in seconds: 0.05571
So it really is only using one CPU core in the test that results in 0.0983517
seconds.
The same convolution in C++, using the original SpatialConvolution
code from the TensorFlow repository
//conv2d_performance_vgg19_layer.cpp
#include <chrono>
#include <iostream>
#include <eigen3/Eigen/Core>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h
#include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h"
// Like in a typical VGG19 layer
const std::size_t k = 512;
const std::size_t x_width = 56;
const std::size_t x_height = 56;
const std::size_t x_depth = 256;
const std::size_t filter_height = 3;
const std::size_t filter_width = 3;
const std::size_t filter_depth = x_depth;
Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width);
Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width);
int main()
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing everything away
const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count();
const std::size_t runs = 10;
for (size_t i = 0; i < runs; ++i)
{
const Eigen::Tensor<float, 3> dest = SpatialConvolution(
x_spatial_conv, filters_spatial_conv);
}
const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count();
const auto mean_elapsed_ms = static_cast<double>(end_time_ns - start_time_ns) / (runs * 1000000000);
std::cout << "Average convolution time in seconds: " << mean_elapsed_ms << std::endl;
}
compiled and ran like that:
g++ -std=c++14 -w -mavx -O3 conv2d_performance_vgg19_layer.cpp -o conv2d_performance_vgg19_layer_avx
./conv2d_performance_vgg19_layer_avx
results in:
Average convolution time in seconds: 0.134459
So just the convolution alone in C++ takes more time than the whole forward pass in Python.
But I can make it faster by allowing more advanced SIMD instructions:
g++ -std=c++14 -w -march=native -O3 conv2d_performance_vgg19_layer.cpp -o conv2d_performance_vgg19_layer_native
./conv2d_performance_vgg19_layer_native
(I'm using a Intel Core i5-6600 CPU @ 3.30GHz for all these tests.)
The result then is:
Average convolution time in seconds: 0.0744446
So my suspicion is, that the TensorFlow wheel somehow uses AVX2 (or whatever) despite telling that it can not. Maybe some internal CPU detection and function-pointer bending is happening.
So, I've build TensorFlow from source (with g++
7.1
, using -march=native
, which is the default in the bazel
-based build, according to the documentation) and installed the resulting wheel.
Now
CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 conv2d_performance_vgg19_layer.py
results in
Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
[...]
Average forward-pass time in seconds: 0.096195
It has not become significantly faster. :point_up: (The remark about SSE, etc., again, seems to be incorrect.)
Thus, I, for now, "accuse" the default TensorFlow wheel, which states to not use AVX2/FMA, of "cheating". :grin:
Based on that, it seems to me, the only fair comparison between TensorFlow and frugally-deep is to use -match=native
for both. And then it looks much better for frugally-deep. The performance on VGG19 actually is quite similar! (0.48 s
vs. 0.54 s
)
Conclusion (up to now): Actually TensorFlow likely did not improve its convolution-on-CPU performance significantly between versions 1.13.2
and 1.15.0
like I initially thought. The devs probably just allowed more advanced SIMD instructions in their default binary (by whatever means). Frugally-deep is similarly fast.
For completeness: The spatial_convolution_test.cpp
from above compiled with -march=native
, gives the following outout:
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 80
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 74
So it confirms the recent observations.
We can use objdump against the tensorflow library and search for avx2 instructions
Also have you tried compiling your code with different compilers? I find I get varying performance with gcc and clang. Even between versions of the same compiler actually.
We can use objdump against the tensorflow library and search for avx2 instructions
Sound interesting. In case you already know how do it: Could you check for the default pip
wheel of TensorFlow 2.0.0
?
Also have you tried compiling your code with different compilers? I find I get varying performance with gcc and clang. Even between versions of the same compiler actually.
Using spatial_convolution_test.cpp
from above, it's quite consistant:
g++-5 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 145
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 137
g++-5 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 83
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 76
g++-7 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 144
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 137
g++-7 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 82
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 75
clang++-3.8 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 151
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 143
clang++-3.9 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 85
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 75
clang++-6.0 -std=c++14 -w -mavx -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 151
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 143
clang++-6.0 -std=c++14 -w -march=native -O3 spatial_convolution_test.cpp -o spatial_convolution_test && ./spatial_convolution_test
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 84
TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 75
I used
objdump -d binary > binary.asm awk '/ \t[ \t]/' binary.asm
from https://superuser.com/questions/726395/how-to-check-if-a-binary-requires-sse4-or-avx-on-linux
against /usr/local/lib/python3.6/dist-packages/tensorflow/libtensorflow_framework.so.1 and got quite a few hits ...
Sorry the command is:
awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' binary.asm
Thanks. Looking explicitly for AVX2 instructions only, I can confirm: TensorFlow has AVX2 instructions in its non-AVX2 binary.
sudo pip3 uninstall -y tensorflow==2.0.0
sudo pip3 install tensorflow==2.0.0
python3 -c "import tensorflow as tf; print(tf.__version__); a = tf.constant([1, 2]);"
# Look for AVX2 instructions
objdump -d /usr/local/lib/python3.7/dist-packages/tensorflow_core/libtensorflow_framework.so.2 | egrep -i "VBROADCASTSS|VBROADCASTSD|VPBROADCASTB|VPBROADCASTW|VPBROADCASTD|VPBROADCASTQ|VBROADCASTI128|VINSERTI128|VEXTRACTI128|VGATHERDPD|VGATHERQPD|VGATHERDPS|VGATHERQPS|VPGATHERDD|VPGATHERDQ|VPGATHERQD|VPGATHERQQ|VPMASKMOVD|VPMASKMOVQ|VPERMPS|VPERMD|VPERMPD|VPERMQ|VPERM2I128|VPBLENDD|VPSLLVD|VPSLLVQ|VPSRLVD|VPSRLVQ|VPSRAV" | wc -l
Output:
2.0.0
2019-11-18 11:16:29.205307: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-18 11:16:29.230405: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3312000000 Hz
2019-11-18 11:16:29.230657: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52b77b0 executing computations on platform Host. Devices:
2019-11-18 11:16:29.230671: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
424
So, I guess from this we can conclude, there is not much to gain from switching to Eigen::Tensor
for us right now.
And regarding the original post:
I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.
This difference was probably caused by OpenCV using multiple CPU cores, right? If so, @pfeatherstone, would you consider this issue as solved?
Yeah I think you've given this an incredibly thorough look at. I would say this is resolved. Regarding OpenCV, yes it was using all CPU cores. Thank you very much for your time!
Thank you for the nice words and also the original input. I still have the gut feeling that a convolution implementation, which is significantly faster than im2col+GEMM, must be possible somehow. Mainly because of the duplication of values in memory from the overlapping receptive fields. It just seems that nobody has found it yet.
But this is an endeavor I can silently continue to explore in the customconvolution
branch when I feel particularly smart. :grin:
For our information: According to this answer on Stack Overflow, Eigen::SpatialConvolution
, as implemented in the TensorFlow code, is - at its core - just a normal im2col convolution, that uses Eigen
's GEMM implementation. So it's no big surprise that it shows a similar performance profile to fdeep
's convolution (also im2col). :smile:
OK, so I just compared the performance of single convolutions (typical
VGG19
-layer config) in frugally-deep with TensorFlow (Eigen::SpatialConvolution
)://spatial_convolution_test.cpp #include <chrono> #include <iostream> #include <eigen3/Eigen/Core> #include <eigen3/unsupported/Eigen/CXX11/Tensor> // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_spatial_convolutions-inl.h // https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/eigen_convolution_helpers.h #include "tensorflow/core/kernels/eigen_spatial_convolutions-inl.h" #include <fdeep/fdeep.hpp> // Like in a typical VGG19 layer const std::size_t k = 512; const std::size_t x_width = 56; const std::size_t x_height = 56; const std::size_t x_depth = 256; const std::size_t filter_height = 3; const std::size_t filter_width = 3; const std::size_t filter_depth = x_depth; fdeep::internal::conv_2d_layer fdeep_conv_layer( "test_conv_layer", fdeep::shape5(1, 1, filter_height, filter_width, x_depth), k, fdeep::internal::shape2(1, 1), fdeep::internal::padding::same, fdeep::internal::shape2(1, 1), fdeep::float_vec(filter_height * filter_width * x_depth * k, 0), fdeep::float_vec(k, 0)); const fdeep::tensor5 x_fdeep(fdeep::shape5(1, 1, x_height, x_width, x_depth), 0); Eigen::Tensor<float, 3> x_spatial_conv(x_depth, x_height, x_width); Eigen::Tensor<float, 4> filters_spatial_conv(k, filter_depth, filter_height, filter_width); float fdeep_im2col_conv() { const auto result = fdeep_conv_layer.apply({x_fdeep}); return result.front().get(0, 0, 0, 0, 0); } float eigen_spatial_conv() { const Eigen::Tensor<float, 3> dest = SpatialConvolution( x_spatial_conv, filters_spatial_conv); return dest(0, 0, 0); } template <typename Func> void measure(const std::string& name, const Func f) { using namespace std::chrono; float checksum = 0.0f; // to prevent compiler from optimizing everything away const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); const std::size_t runs = 10; for (size_t i = 0; i < runs; ++i) { checksum += f(); } const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / (runs * 1000000); std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl; } int main() { measure("frugally-deep convolution (im2col + GEMM) ", fdeep_im2col_conv); measure("TensorFlow Eigen::SpatialConvolution ", eigen_spatial_conv); }
Output:
frugally-deep convolution (im2col + GEMM) (checksum: 0) elapsed_ms: 143 TensorFlow Eigen::SpatialConvolution (checksum: 0) elapsed_ms: 136
So the difference is marginal. It does not explain the huge difference of
0.93 s
vs.0.48 s
of a forward pass on aVGG19
model. And it's not that frugally-deep spends all this time outside of the convolution code. At least according to my profiler, it's spending the vast majority of the time there:(profiling of forward passes on a
VGG19
model)So my conclusion up to now is:
- TensorFlow's single convolution is not significantly faster than the one in frugally-deep.
- TensorFlow nevertheless is roughly twice as fast on a convolution-heavy model.
- Thus they do something else, e.g., fusing consecutive convolutions or something. I don't know yet. I'm trying to understand their code, but it's not easy for me.
Hi @Dobiasd! Which profiler did you use?
sysprof
cheers
@Dobiasd sorry to ask this again, this is totally unrelated to the issue. Can you get sysprof
to work with a single application? If i hook to 1 or 2 processes, the number of samples
never exceeds 0...
I usually just run sudo sysprof
and let it profile my system globally. The results are mostly good enough ™️. 🙂
Hmm. OKidok. Don't really understand why it doesn't work with a single application. It seems there aren't that many good C++ profilers out there, which is surprising. I found google/orbit, but can't get it to build.
FYI, using perf with https://github.com/KDAB/hotspot works really well
I noticed that Eigen 3.3 has unsupported modules, including modules for Tensors and gemm operations.
https://bitbucket.org/eigen/eigen/src/9b065de03d016d802a25366ff5f0055df6318121/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default#markdown-header-convolutions
I noticed you implement your own gemm operation in fdeep/convolution.hpp in function convolve_im2col. This could be improved by using gemm functions from the eigen unsupported modules.
I ran a test by inferring the UNet model from pix2pix in frugally deep. It took 18s compared to a model converted from onnx and inferred in OpenCV which took 3s. I think this shows that convolutions in frugally could be improved.
Thanks