Model outputs NaNs with Pytorch >= 2.0.1

aipixel / AEMatter

Another matter.

GNU General Public License v2.0

56 stars 2 forks source link

Model outputs NaNs with Pytorch >= 2.0.1 #2

Closed jacobbieker closed 1 year ago

jacobbieker commented 1 year ago

Hi,

I've been trying out this model, looks really great! And simple to use. But when trying it with PyTorch >= 2.0.1 with cuda the model outputs NaNs, while working fine on PyTorch 1.13.1.

Windaway commented 1 year ago

Thank you. I have accessed the intermediate features in PyTorch 2.0 and found differences in the last digit of some features. The accumulation of these differences resulted in the model crashing. I will attempt to train the model in PyTorch 2.0 to make it work.

Windaway commented 1 year ago

However, you may use the model in CPU devices.

jacobbieker commented 1 year ago

Okay, thank you! Looking forward to the PyTorch 2 version, and will use the CPU for now.

Windaway commented 1 year ago

You can temporarily use the PT20 branch and this checkpoint at https://mega.nz/file/mVIUATIC#kBQhbHKq9op5KmCbQ5NB-klS7bpl8H_ba4PycsBlkiQ to test real-world images. I have modified the positions of LayerNorm and Attention to avoid model crashing, especially in FP16 mode. We are still training models that perform well on synthetic datasets.

Windaway commented 1 year ago

You may check https://github.com/QLYoo/AEMatter/tree/PT20 now.