Rewriting module structure: now consistence with PyTorch dispatcher.
Code refactoring
Python API enhancements: new weights initialization schemes and DepthWise Convolution emulation.
Small performance improvements, but on GPU shifts is still 100x slower (in comparison with dw conv 3x3) than I expected on forward pass and 10x slower on backward.
Torchshifts3.0: