Incorrect correlation result if kernel_size > 1

The correlation result for any kernel_size > 1 is incorrect. A trivial proof is the correlation of two 3x3 images of all ones using a 3x3 kernel that should yield a scalar output of 1 (9 / 9).

>>> import torch, numpy as np
>>> from networks.correlation_package.correlation import Correlation;

>>> N, C, H, W = (1, 1, 3, 3)
>>> x1 = torch.tensor(np.ones((N, C, H, W)))
>>> x2 = torch.tensor(np.ones((N, C, H, W)))
>>> correlation = Correlation(pad_size=0, kernel_size=3, max_displacement=0, stride1=1, stride2=1)
>>> correlation(x1.cuda(), x2.cuda())
tensor([[[[0.5556]]]], device='cuda:0')

The erroneous result of 5 / 9 is from wrong index calculation for the kernel offset which produces negative offsets into unallocated memory. https://github.com/NVIDIA/flownet2-pytorch/blob/71034046166735a79a5b82df78de72d806e82842/networks/correlation_package/correlation_cuda_kernel.cu#L114-L127

NVIDIA / flownet2-pytorch

Incorrect correlation result if kernel_size > 1 #194