Closed windygoo closed 9 months ago
You should rescale the input images and ground-truth masks to 1024x1024 pixels to make them divisible by the token size.
You should rescale the input images and ground-truth masks to 1024x1024 pixels to make them divisible by the token size.
Ok, thanks!
Given an image of size (3, 1000, 1000), the shape of returned tensor is (3, 4, 4, 256, 256). Actually, if using overlap of 64 pixels, the shape should be ((3, 5, 5, 256, 256)).
Illustration:
Input: a = torch.arange(1000) b = a.unfold(0, 256, 256 - 64) print(b)
Output: tensor([[ 0, 1, 2, ..., 253, 254, 255], [192, 193, 194, ..., 445, 446, 447], [384, 385, 386, ..., 637, 638, 639], [576, 577, 578, ..., 829, 830, 831]])
The rest region from 832 to 999 is ignored during inference.