Instead of downsamling the original image split it into multiple patches which can be passed through the model in parallel within a batch.
E.g. having an image of size of 1024x1024:
Train a model on image size 128x128
1.1 On random crops
1.2 On grid based crops
1.3 Try to add positional embedding (similar to timestamp embedding, but for image patch location )
(not necessary for uniform textures instead of objects, e.g. leather vs hazelnut)
During anomaly detection split 1024x1024 image into 64 128x128 patches
Run the image patches as a batch through the model
Join patches and resume with high-res input and reconstruction
Instead of downsamling the original image split it into multiple patches which can be passed through the model in parallel within a batch.
E.g. having an image of size of 1024x1024: