'nan' loss when training 'ByteFormer' using ImageNet

Hi,

I observed the 'nan' loss when using 'RTX A6000 ada' as gpu and attempting to train the ByteFormer by using the config file 'examples/byteformer/imagenet_file_encodings/encoding_type=TIFF.yaml'.

There were still observed nan loss when changing the gpu device to 'RTX 4090'.

I wonder whether you didn't see the 'nan' loss when training the ByteFormer using ImageNet as training set.

The used modules for training are like these.

cvnets 0.3 torch 1.13.1+cu117 torchaudio 0.13.1+cu117 torchtext 0.14.1 torchvision 0.14.1+cu117

apple / ml-cvnets

'nan' loss when training 'ByteFormer' using ImageNet #103