Closed coreylowman closed 1 year ago
From a bit of testing, it looks like unfold_input
is fast enough (152.2s without vs 171.8s with) but unfold_output
is slow (65.9s without vs 171.5s with) (Both done on 1000 tests). I turned it into only iterating over each element of the image once and putting the kernel in a loop. This reduces it to 86.5s with the new backward. Ill do a pull request once I test it.
Awesome! Will check it out today
Also wondering if setting batch & chan as grid dimensions could help
Another thought: for unfolding, probably most of the time is spent doing the loop over k1/k2 & skipping bad k1/k2. The same checks for k1/k2 are probably happening in multiple places (specifically batch * chan_out per pixel). Instead we should make sure k1/k2 checks only happen once per pixel.
I.e. each kernel should do a loop over batch/chan
edit: this seems to have backfired. it takes way longer to do the above edit2: moving chan_out to grid_dim, fixes it
I was able to parallelize the kernels like below if each thread loops over batch. However unfold_patches is taking a lot longer now:
Diff is at https://github.com/coreylowman/dfdx/compare/par-conv2d-kernels
Okay after reading through the https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/, I have some ideas:
Currently the patches elements are ordered like:
(B, C, K, K, OH, OW)
(B, O, K, K, H, W)
Which notably means two sequential threads are working on pixel space which have completely different memory accesses as far as I can think.
I'm wondering if instead we should transpose the patches & put channels at the end instead of before K * K, so the shapes would be:
(B, OH, OW, K, K, C)
(B, H, W, K, K, O)
Notably, this would mean the stride would be along the channel dimension of patches, meaning every thread in a warp (except on channel boundaries) would be processing the same element of the input & output image.
Then we can juts use transposed gemms for cublas.
Okay I tried the above and it didn't make that much difference. I tried:
The second one made sum_transposed_filters much slower.
I'm thinking the biggest thing here will be coalescing global memory access. So will need to study up on that section in the guide (9.2.1)
I'm going to close this for now - I think we are as close as we are going to get here.
I think the next big optimization is going to be using cudnn for this. I did some testing with pytorch, and if you disable cudnn we are actually pretty close to performance!
Currently in convolution networks, the unfold_output & unfold_input kernels take a signficant portion of time. This issue is about optimizing them to reduce this time.
My guess is there are some things we can do with multi-dimensional block/grid dims, as well as reducing the number of branches in the kernel.