2D downsampling of uint8 data inefficient

nkemnitz commented 4 months ago

All our tensors are passed as NCXYZ to torch and converted to float32. That's not just a copy, but also 4x more memory.

pytorch interpolate supports 'bilinear' downsampling for uint8 data, but requires NCXY tensor input
tinybrain has fast AvgPool for (2,2), (2,2,1), (2,2,1,1), as well as for (2,2,2), (2,2,2,1) ndarrays

Another thing to consider is that CloudVolume data already is in Fortran order, which tinybrain expects

data = np.asfortranarray(np.random.randint(0,255, size=(1,1,4096,4096,1), dtype=np.uint8))

# Torch CPU, uint8->float32->uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).float(), scale_factor=[0.5,0.5,1.0], mode='trilinear').byte()
84 ms ± 774 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Torch GPU, uint8->float32->uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).cuda().float(), scale_factor=[0.5,0.5,1.0], mode='trilinear').byte().cpu()
6.34 ms ± 32.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Torch CPU, uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).squeeze(-1), scale_factor=[0.5,0.5], mode='bilinear')
162 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Torch CUDA, uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).cuda().squeeze(-1), scale_factor=[0.5,0.5], mode='bilinear').cpu()
RuntimeError: "upsample_bilinear2d_out_frame" not implemented for 'Byte'

# Tinybrain, uint8->float32->uint8
%timeit tinybrain.downsample_with_averaging(data.astype(np.float32).squeeze((0,1)), factor=[2,2])[0].astype(np.uint8)
32.1 ms ± 254 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Tinybrain, uint8
%timeit tinybrain.downsample_with_averaging(data.squeeze((0,1)), factor=[2,2])[0]
1.45 ms ± 12.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

supersergiy commented 4 months ago

It's unexpected for me that performance matters here. I thought interpolation performance would be mostly bound by bandwidth

nkemnitz commented 4 months ago

Just checked - downloading a 4k x 4k uint8 JPG patch is 100-150 ms. Similar to current downsampling behavior

supersergiy commented 4 months ago

Wow, that's a crazy fast download! But also, doesn't that mean that there's basically no inefficiency if we use pipelining? At the same time, it maybe doesn't matter and we should just put tinybrain in instead of default torch behavior. It's not a hard fix.

ZettaAI / zetta_utils

2D downsampling of uint8 data inefficient #737