Open jessebett opened 5 years ago
As expected, GPU is a completely different story. FastConv
can be extended which I've done in this PR to FastConv.jl.
However, FastConv requires scalar getindex
operations, which slow things down considerably. NNlib
on GPU is clearly fine. So this issue is about the implementation of convolutions on CPU
.
using CuArrays
using FastConv
using NNlib
using BenchmarkTools
x = randn(500,500,1,1) |> cu
spatial_dims = (5,5)
k = randn(spatial_dims...,1,1) |>cu
cdims = DenseConvDims(x,k; padding= spatial_dims .-1)
#NN lib on GPU
@btime CuArrays.@sync conv(x,k,cdims);
# 224.942 μs (87 allocations: 4.05 KiB)
# Fast Conv on CPU
@btime convn($(collect(x)),$(collect(k)));
# 8.431 ms (8 allocations: 992.86 KiB)
cc @staticfloat
When I looked at ImageFiltering.jl, the main issue was that it wasn't designed to support / scale well across large channel dimensions. Not sure if FastConv has the same issue; given the speedup here it could be useful to just dispatch to for the cases where it makes sense.
Should be quite easy to hook up for anyone interested.
@staticfloat I was gonna try comparing with multiple channels and batches but it looks like you're right and FastConv
doesn't have this (or possibly I'm using it incorrectly):
using FastConv
using NNlib
using BenchmarkTools
# 3 Channels, 1 Batch
x = randn(500,500,3,1);
spatial_dims = (5,5);
k = randn(spatial_dims...,3,1);
cdims = DenseConvDims(x,k; padding= spatial_dims .-1);
fast_y = convn(x,k);
nnlib_y = conv(x,k,cdims);
fast_y |>size # (504, 504, 5, 1)
nnlib_y |>size # (504, 504, 1, 1)
Yeah, FastConv doesn't support multiple channels; what you're doing is instead doing a 3d convolution, so it ends up increasing the 3rd dimension to 3 + 3 - 1
. Difficult to compare apples to apples here.
You may be interested in https://github.com/FluxML/NNlib.jl/pull/142
@staticfloat nice!
Convolutions provided by the
FastConv
packageDescribed in their paper is considerably outperforming the back ends for 1D and 2D convolutions. At least on CPU.