Closed niczky12 closed 4 years ago
Knet7 had a cpu conv implementation written by Onur Kuru in https://github.com/denizyuret/Knet.jl/blob/master/deprecated/src7/util/conv_pool_cpu.jl
This has not been ported / tested on Knet8 yet, it is on the todo list.
On Wed, Nov 2, 2016 at 7:03 PM niczky12 notifications@github.com wrote:
Just wondering, but is there a way to use conv and pool` without a GPU? I'm running a windows machine and even though I have a nvidia card installed, I failed to install CUDA. If any of you have tips on how to get this working that would be appreciated.
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/33, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpvXzqA1PuYAwBxoVjvwyTmELYdQ4ks5q6LRNgaJpZM4Kna_F .
Some experimental code in the cpuconv branch. Not all padding/stride options supported. Slow and not fully tested.
Onur's latest cpu conv code: https://github.com/kuruonur1/CNN.jl
This is incorporated in the latest master. Can try to make it more efficient. We should also find open source kernels to try, from ArrayFire, Nervana etc. to replace cuDNN and to inform more efficient CPU implementations. I am keeping this issue open for ongoing work.
Mocha.jl has CPU implementations, should check out the speed.
Working on integrating Mocha CPU conv/pool under mochaconv branch.
Mocha cpu conv/pool kernels have been integrated. They utilize multiple cores using openmp. I don't think the cpu conv/pool speed is going to get much better, they are about 10x slower than gpu. It may be possible to have a single im2col operation instead of one for each image.
I am leaving this issue open for now to see if (1) we can find better cpu kernels, (2) we can find better open source gpu kernels to replace cudnn.
For CPU, you can look at what we did for n-dimensional convolutions (we used conv2
when we should've of used convnd
) in Seep.jl here . We are currently looking into using CudaNative.jl and llvm for julia-0.6
to produce efficient gpu kernels.
That's great news! I would love to try some open source gpu kernels when you guys have something ready to test. I haven't looked at CudaNative yet, but if I can help with benchmarking etc. let me know.
For CPU Onur's implementation also used conv2 but it was too slow. In the latest release I adapted the C++ kernels from Mocha.jl which use openmpi and work pretty fast. See Knet.jl/prof/conv.jl for some benchmarking results, we should compare with the Seep.jl implementation.
Thanks for the CPU references. I had meant we extended the name conv2
when in fact it is an N-dimensional implementation. We avoided doing a im2col
operation because it uses too much memory when building the graph. We haven't done much on benchmarking, and we are also very limited in our ability release code updates.
You should also look at ImageFiltering.jl. Tim Holy has made a lot of optimizations for doing efficient convolutions on images with im_filter
. No gradients though.
The latest benchmarks from @ilkarman (https://github.com/ilkarman/DeepLearningFrameworks) show our cpu implementation to be quite inefficient. There is a new thread in (https://discourse.julialang.org/t/on-machine-learning-and-programming-languages/7574/30) suggesting alternatives. We need volunteers to reimplement cpu convolution operations using Intel MKL.
Dynet-benchmarks by @ilkerkesen also show a similar trend for our cpu implementation of the cudnn rnn kernels. Knet compares very well to Chainer and Dynet on the GPU but the cpu performance is lacking. A similar volunteer effort is needed there.
Also see fb.me/83w6aHEJO With Onur's summary from 3/28/16: Convolution icin Fourier veya Winograd Transform kullanmislar. Birkac network configurasyonu icin Im2col + gemm'e (bizim kullandigimiza) gore 2'den 4 kata kadar hizli calistigini gosteriyorlar. Repo surada: https://github.com/Maratyszcza/NNPACK C'de yazilmis ve derlenip Julia'dan cagiriabilir. Fakat su anda iki limitasyonu var:
Hey, jumping into the thread here. Any current plans for addressing the speed on CPU when using Knet? I really like Knet as it's really native to julia and nice to work with. However, I'm stuck with CPU for a while and would like to get MXNet.jl performance if possible.
Nobody is actively working on this right now, we are looking for volunteers...
On Tue, Jan 9, 2018 at 2:46 AM Michael Green notifications@github.com wrote:
Hey, jumping into the thread here. Any current plans for addressing the speed on CPU when using Knet? I really like Knet as it's really native to julia and nice to work with. However, I'm stuck with CPU for a while and would like to get MXNet.jl performance if possible.
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/33#issuecomment-356132995, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpnT4ujB4dy5m3m6oxpIW5yMwKadPks5tIqi8gaJpZM4Kna_F .
Maybe the code from https://github.com/CNugteren/CLBlast
could be helpful, as an alternative to BLAS, CLBLAS. This code supports FP16 compute. For convolutions using matrix multiplies https://arxiv.org/abs/1704.04428.
https://github.com/intel/mkl-dnn may be a good solution?
https://discourse.julialang.org/t/knet-vs-flux-etc/17057/10?u=denizyuret shows that Flux is faster in CPU convolutions. Mike Innes says: "(Flux uses) NNlib’s pure-Julia convolutions vs Knet’s threaded C++ ones, although NNlib is soon to move to NNPACK".
There is a julia wrapper for NNPACK intended to be used in NNlib.jl for Flux. https://github.com/avik-pal/NNPACK.jl
The problem with NNPACK is that for small batchsizes it is slower compared to NNlib.jl's native Julia conv. https://github.com/FluxML/NNlib.jl/pull/67#issuecomment-442813706
Similarly, NNPACK is also slower compared to Pytorch's conv in small batchsizes.
https://github.com/pytorch/pytorch/pull/2826#issuecomment-333221184
Apparently they don't utilize NNPACK for now. But if they do, it seems they will resort to a heuristic based approach to switch between default conv and NNPACK implementation depending on the input parameters such as batchsize and number of channels.
There other problems with NNPACK. -It does not support 3D conv and pooling: https://github.com/Maratyszcza/NNPACK/issues/138#issue-316336389 -It does not support strided convolution for training https://github.com/Maratyszcza/NNPACK/issues/139#issue-319679698
@cemilcengiz, we are trying to pass CI tests on windows, ARM etc with @ianshmean and the CPU conv kernels are causing trouble. (1) Is NNlib's pure Julia implementation comparable in speed to our CPU kernels? (2) Does NNPACK require any compiling or library installations? (3) Any progress/improvements in any of the solutions mentioned above (mkl-dnn, Seep.jl, ImageFiltering.jl), CLBlast.
My current concern is for ease of installation rather than speed. So if it is not too much slower, I'd like to go with a pure Julia solution.
https://github.com/denizyuret/Knet.jl/pull/494 switches to NNlib for CPU conv/pool.
Just wondering, but is there a way to use
conv
andpool
without a GPU? I'm running a windows machine and even though I have a nvidia card installed, I failed to install CUDA. If any of you have tips on how to get this working that would be appreciated.Thanks!