alexandrosstergiou / SoftPool

[ICCV 2021] Code for approximated exponential maximum pooling
MIT License
288 stars 52 forks source link

find out nan in tensor #12

Closed PJJie closed 3 years ago

PJJie commented 3 years ago

When I replace multiple maxtools with softtools, I find out Nan in tensor

alexandrosstergiou commented 3 years ago

Returned NaN values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.

I am also planning on including this in the coming repo commits.

Best, Alex

MaxChanger commented 2 years ago

Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project https://github.com/haomo-ai/MotionSeg3D/issues/6

alexandrosstergiou commented 2 years ago

Hi @MaxChanger. Most NaNvalue-problems in fwd/bwd calls have been fixed after torch 1.6 where torch.amp was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN values occurring whilst training in other projects.

Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?

MaxChanger commented 2 years ago

Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan) too. Thus, I thought your project was robust enough. After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.