Optimize conv2d folding kernels

coreylowman commented 1 year ago

Currently in convolution networks, the unfold_output & unfold_input kernels take a signficant portion of time. This issue is about optimizing them to reduce this time.

My guess is there are some things we can do with multi-dimensional block/grid dims, as well as reducing the number of branches in the kernel.

opfromthestart commented 1 year ago

From a bit of testing, it looks like unfold_input is fast enough (152.2s without vs 171.8s with) but unfold_output is slow (65.9s without vs 171.5s with) (Both done on 1000 tests). I turned it into only iterating over each element of the image once and putting the kernel in a loop. This reduces it to 86.5s with the new backward. Ill do a pull request once I test it.

coreylowman commented 1 year ago

Awesome! Will check it out today

coreylowman commented 1 year ago

Also wondering if setting batch & chan as grid dimensions could help

simplify the kernels
reduce the occupancy

coreylowman commented 1 year ago

Another thought: for unfolding, probably most of the time is spent doing the loop over k1/k2 & skipping bad k1/k2. The same checks for k1/k2 are probably happening in multiple places (specifically batch * chan_out per pixel). Instead we should make sure k1/k2 checks only happen once per pixel.

I.e. each kernel should do a loop over batch/chan

edit: this seems to have backfired. it takes way longer to do the above edit2: moving chan_out to grid_dim, fixes it

coreylowman commented 1 year ago

I was able to parallelize the kernels like below if each thread loops over batch. However unfold_patches is taking a lot longer now:

Diff is at https://github.com/coreylowman/dfdx/compare/par-conv2d-kernels

coreylowman commented 1 year ago

Okay after reading through the https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/, I have some ideas:

We want thread warps in folding to be as parallel as possible.
We want threads in a warp to access as much as the same memory as possible
We want different warps to access different memory (will need to double check this)

Currently the patches elements are ordered like:

input patches: (B, C, K, K, OH, OW)
output patches: (B, O, K, K, H, W)

Which notably means two sequential threads are working on pixel space which have completely different memory accesses as far as I can think.

I'm wondering if instead we should transpose the patches & put channels at the end instead of before K * K, so the shapes would be:

input patches: (B, OH, OW, K, K, C)
output patches: (B, H, W, K, K, O)

Notably, this would mean the stride would be along the channel dimension of patches, meaning every thread in a warp (except on channel boundaries) would be processing the same element of the input & output image.

Then we can juts use transposed gemms for cublas.

coreylowman commented 1 year ago

Okay I tried the above and it didn't make that much difference. I tried:

Just transpose the patches
Also reorder filters to O K K * C

The second one made sum_transposed_filters much slower.

I'm thinking the biggest thing here will be coalescing global memory access. So will need to study up on that section in the guide (9.2.1)

coreylowman commented 1 year ago

I'm going to close this for now - I think we are as close as we are going to get here.

I think the next big optimization is going to be using cudnn for this. I did some testing with pytorch, and if you disable cudnn we are actually pretty close to performance!

coreylowman / dfdx

Optimize conv2d folding kernels #578