BA-TDF+ training goes to NaN

ControlNet / LAV-DF

[CVIU] Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

https://www.sciencedirect.com/science/article/pii/S1077314223001984

Other

67 stars 8 forks source link

BA-TDF+ training goes to NaN #24

Closed aliceinland closed 2 months ago

aliceinland commented 2 months ago

Dear author,

I was trying to replicate your results. I downloaded your dataset and followed the instructions presented on the GitHub page. Anyway, before reaching the end of epoch 0, around step 39K, the loss goes to NaN. I did not modify or change any of the pre-defined parameters you have put in the code. Do you have any suggestions of what could cause the issue? The same issue, is not present when I train BA-TFD.

ControlNet commented 2 months ago

Hi,

I think one possible reason is you're using float16 for training. Sometimes the numbers will exceed the value range of float16, and become NaN. Using float32 should solve it.

aliceinland commented 2 months ago

Hello, thank you for the reply! I am using float32 high precision and I still get NaN values. Here the line of code that I added to the code, in addition to passing the value --precision 32 to the Parser:

torch.set_float32_matmul_precision('high')

ControlNet commented 2 months ago

What is the batch size you're using?

aliceinland commented 2 months ago

The batch size was 8. By changing the GPUs model, the issue has been solved! Thank you :)