denizyuret / Knet.jl

Koç University deep learning framework.
https://denizyuret.github.io/Knet.jl/latest
Other
1.43k stars 230 forks source link

Using conv and pool without CUDA #33

Closed niczky12 closed 4 years ago

niczky12 commented 7 years ago

Just wondering, but is there a way to use conv and pool without a GPU? I'm running a windows machine and even though I have a nvidia card installed, I failed to install CUDA. If any of you have tips on how to get this working that would be appreciated.

Thanks!

denizyuret commented 7 years ago

Knet7 had a cpu conv implementation written by Onur Kuru in https://github.com/denizyuret/Knet.jl/blob/master/deprecated/src7/util/conv_pool_cpu.jl

This has not been ported / tested on Knet8 yet, it is on the todo list.

On Wed, Nov 2, 2016 at 7:03 PM niczky12 notifications@github.com wrote:

Just wondering, but is there a way to use conv and pool` without a GPU? I'm running a windows machine and even though I have a nvidia card installed, I failed to install CUDA. If any of you have tips on how to get this working that would be appreciated.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/33, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpvXzqA1PuYAwBxoVjvwyTmELYdQ4ks5q6LRNgaJpZM4Kna_F .

denizyuret commented 7 years ago

Some experimental code in the cpuconv branch. Not all padding/stride options supported. Slow and not fully tested.

denizyuret commented 7 years ago

Onur's latest cpu conv code: https://github.com/kuruonur1/CNN.jl

denizyuret commented 7 years ago

This is incorporated in the latest master. Can try to make it more efficient. We should also find open source kernels to try, from ArrayFire, Nervana etc. to replace cuDNN and to inform more efficient CPU implementations. I am keeping this issue open for ongoing work.

denizyuret commented 7 years ago

Mocha.jl has CPU implementations, should check out the speed.

denizyuret commented 7 years ago

Working on integrating Mocha CPU conv/pool under mochaconv branch.

denizyuret commented 7 years ago

Mocha cpu conv/pool kernels have been integrated. They utilize multiple cores using openmp. I don't think the cpu conv/pool speed is going to get much better, they are about 10x slower than gpu. It may be possible to have a single im2col operation instead of one for each image.

I am leaving this issue open for now to see if (1) we can find better cpu kernels, (2) we can find better open source gpu kernels to replace cudnn.

jgbos commented 7 years ago

For CPU, you can look at what we did for n-dimensional convolutions (we used conv2 when we should've of used convnd) in Seep.jl here . We are currently looking into using CudaNative.jl and llvm for julia-0.6 to produce efficient gpu kernels.

denizyuret commented 7 years ago

That's great news! I would love to try some open source gpu kernels when you guys have something ready to test. I haven't looked at CudaNative yet, but if I can help with benchmarking etc. let me know.

For CPU Onur's implementation also used conv2 but it was too slow. In the latest release I adapted the C++ kernels from Mocha.jl which use openmpi and work pretty fast. See Knet.jl/prof/conv.jl for some benchmarking results, we should compare with the Seep.jl implementation.

jgbos commented 7 years ago

Thanks for the CPU references. I had meant we extended the name conv2 when in fact it is an N-dimensional implementation. We avoided doing a im2col operation because it uses too much memory when building the graph. We haven't done much on benchmarking, and we are also very limited in our ability release code updates.

You should also look at ImageFiltering.jl. Tim Holy has made a lot of optimizations for doing efficient convolutions on images with im_filter. No gradients though.

denizyuret commented 6 years ago

The latest benchmarks from @ilkarman (https://github.com/ilkarman/DeepLearningFrameworks) show our cpu implementation to be quite inefficient. There is a new thread in (https://discourse.julialang.org/t/on-machine-learning-and-programming-languages/7574/30) suggesting alternatives. We need volunteers to reimplement cpu convolution operations using Intel MKL.

Dynet-benchmarks by @ilkerkesen also show a similar trend for our cpu implementation of the cudnn rnn kernels. Knet compares very well to Chainer and Dynet on the GPU but the cpu performance is lacking. A similar volunteer effort is needed there.

denizyuret commented 6 years ago

Also see fb.me/83w6aHEJO With Onur's summary from 3/28/16: Convolution icin Fourier veya Winograd Transform kullanmislar. Birkac network configurasyonu icin Im2col + gemm'e (bizim kullandigimiza) gore 2'den 4 kata kadar hizli calistigini gosteriyorlar. Repo surada: https://github.com/Maratyszcza/NNPACK C'de yazilmis ve derlenip Julia'dan cagiriabilir. Fakat su anda iki limitasyonu var:

DoktorMike commented 6 years ago

Hey, jumping into the thread here. Any current plans for addressing the speed on CPU when using Knet? I really like Knet as it's really native to julia and nice to work with. However, I'm stuck with CPU for a while and would like to get MXNet.jl performance if possible.

denizyuret commented 6 years ago

Nobody is actively working on this right now, we are looking for volunteers...

On Tue, Jan 9, 2018 at 2:46 AM Michael Green notifications@github.com wrote:

Hey, jumping into the thread here. Any current plans for addressing the speed on CPU when using Knet? I really like Knet as it's really native to julia and nice to work with. However, I'm stuck with CPU for a while and would like to get MXNet.jl performance if possible.

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/denizyuret/Knet.jl/issues/33#issuecomment-356132995, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvNpnT4ujB4dy5m3m6oxpIW5yMwKadPks5tIqi8gaJpZM4Kna_F .

davidbp commented 6 years ago

Maybe the code from https://github.com/CNugteren/CLBlast could be helpful, as an alternative to BLAS, CLBLAS. This code supports FP16 compute. For convolutions using matrix multiplies https://arxiv.org/abs/1704.04428.

denizyuret commented 5 years ago

https://github.com/intel/mkl-dnn may be a good solution?

denizyuret commented 5 years ago

https://discourse.julialang.org/t/knet-vs-flux-etc/17057/10?u=denizyuret shows that Flux is faster in CPU convolutions. Mike Innes says: "(Flux uses) NNlib’s pure-Julia convolutions vs Knet’s threaded C++ ones, although NNlib is soon to move to NNPACK".

cemilcengiz commented 5 years ago

There is a julia wrapper for NNPACK intended to be used in NNlib.jl for Flux. https://github.com/avik-pal/NNPACK.jl

The problem with NNPACK is that for small batchsizes it is slower compared to NNlib.jl's native Julia conv. https://github.com/FluxML/NNlib.jl/pull/67#issuecomment-442813706

Similarly, NNPACK is also slower compared to Pytorch's conv in small batchsizes.
https://github.com/pytorch/pytorch/pull/2826#issuecomment-333221184

Apparently they don't utilize NNPACK for now. But if they do, it seems they will resort to a heuristic based approach to switch between default conv and NNPACK implementation depending on the input parameters such as batchsize and number of channels.

There other problems with NNPACK. -It does not support 3D conv and pooling: https://github.com/Maratyszcza/NNPACK/issues/138#issue-316336389 -It does not support strided convolution for training https://github.com/Maratyszcza/NNPACK/issues/139#issue-319679698

denizyuret commented 5 years ago

@cemilcengiz, we are trying to pass CI tests on windows, ARM etc with @ianshmean and the CPU conv kernels are causing trouble. (1) Is NNlib's pure Julia implementation comparable in speed to our CPU kernels? (2) Does NNPACK require any compiling or library installations? (3) Any progress/improvements in any of the solutions mentioned above (mkl-dnn, Seep.jl, ImageFiltering.jl), CLBlast.

My current concern is for ease of installation rather than speed. So if it is not too much slower, I'd like to go with a pure Julia solution.

denizyuret commented 5 years ago

https://github.com/denizyuret/Knet.jl/pull/494 switches to NNlib for CPU conv/pool.