Implementation details and scope for performance improvements

anijain2305 commented 8 years ago

Hi,

First of all, this is an amazing effort. I am using this library for my research where I am investigating the micro-architectural bottlenecks for DNN applications on CPUs. This will lead to ideas to redesign the CPUs to get better performance for DNN applications. Your library does lead up to good numbers in multiple scenarios. I would like to congratulate you on your efforts.

Over the last month, I have performed a thorough evaluation of nnpack across 4-5 large networks with varying batch sizes, along with analyzing multi-threading. With current implementation, I observe that there is no clear choice between the GEMM or Winograd/FFT implementation across all the scenarios. I see that for small batch sizes, GEMM is better. However, for large batch sizes, the nnpack provides better performance. For multi-threading, GEMM seems to be more friendly.

It will be really helpful if you can provide the implementation details as done by Nervana designers on ArXiv. My next steps require me to understand the details and argue about tile sizes, memory access patterns and throughput. It seems difficult to clearly understand that from the code.

Finally, I have few questions about the implementation. I might be asking a wrong question here as my understanding of transform algorithms is very recent ( few hours :) ).

1) It seems that the code uses cxgemm (complex gemm) even for Winograd transforms. If I understand correctly, Winograd does not have to go through any complex multiplications. Am I understanding anything wrong here?

2) Can you also tell how is the input image saved in the memory? Is it NCHW as is present generally in GEMM implementations , where N=batch size, C = channels, H = height and W = width? Or is it CHWN as is presented in the ArXiv paper from Nervana (https://arxiv.org/abs/1509.09308)

3) Finally, what is the scope of improvement here? Do you think that the implementations of Winograd/FFT are close to the best implementations possible on CPU? I went through this interesting discussion (https://www.reddit.com/r/MachineLearning/comments/4bswi6/nnpack_acceleration_package_for_neural_networks/#bottom-comments) and it looks like that smaller Winograd should beat everything, even for small batch sizes. Because, if that is the case, then it changes how we should think about improving CPU micro-architecture. Do you plan to work on it?

Maratyszcza commented 8 years ago

cxgemm is a tuple-based complex matrix-matrix multiplication micro-kernel. Please see slides from my BLIS retreat 2016 talk for details. cxgemm is only used with FFT-based algorithm, Winograd-based algorithm does sxgemm instead.
NNPACK uses NCHW format - the same as Caffe, Torch & Theano (but different from TensorFlow and Neon).
nnp_convolution_inference is close to optimum, at least in single-threaded execution. Other convolution functions are behind in optimizations. No plans to implement smaller Winograd tiles.

anijain2305 commented 8 years ago

Thanks for the comments and the slides.

Also, my characterization was based on the convolution-output implementation as caffe-nnpack directly links to output functions. So, I think I went the wrong path.

Can you tell me how to link to convolution-interference? Is my current understanding that caffe-nnpack links to convolution-output incorrect?

Maratyszcza commented 8 years ago

@anijain2305 Please see nnpack-pr branch in ajtulloch/caffe, which is a newer version of Caffe bindings. caffe-nnpack wasn't updated for a long time.

Maratyszcza / caffe-nnpack

Implementation details and scope for performance improvements #6