I noticed, that there is only one transform: ToTensor() in the DataLoader.
Why don't you use image normalization (mean, std) before first VIT's layers?
Hi @GhostLate , we follow OSX in training the transformer backbones. We didn't conduct extensive experiments on training details. However, some tuning here and there may be useful.
I noticed, that there is only one transform:
ToTensor()
in the DataLoader. Why don't you use image normalization (mean, std) before first VIT's layers?