Nan loss with baseline model

JunMa11 / NeurIPS-CellSeg

Naive baseline for microscopy image segmentation challenge in NeurIPS 2022

Apache License 2.0

58 stars 18 forks source link

Nan loss with baseline model #6

Closed hasukmin12 closed 1 year ago

hasukmin12 commented 1 year ago

Hi,

I keep getting Nan loss around 20 epochs. I haven't changed anything in the code yet and just run. it keep happens even if i use other network like swinunetr, unetr.

JunMa11 commented 1 year ago

Please make sure your monai version is 0.9.

JunMa11 commented 1 year ago

If you are using the old MONAI, please try to remove the data augmentation

https://github.com/JunMa11/NeurIPS-CellSeg/blob/2df1ee5dc26b4ff10202da73ef22d72651e8e5bd/baseline/model_training_3class.py#L128-L148

hasukmin12 commented 1 year ago

now i'm using monai 0.9.1 but it still happened

hasukmin12 commented 1 year ago

when i check this problem by below code

this error happens

hasukmin12 commented 1 year ago

I really don't know why, but inputs gets 'nan' around 20 epochs

JunMa11 commented 1 year ago

We cannot reproduce your error.

please delete the data and re-run the preprocessing.
Have you tried to remove the data augmentation?

JintuZheng commented 1 year ago

I got the same problem as @hasukmin12 says, my envs: torch 1.10/1.11/1.12+cu113, monai 1.0/0.91/0.9, if your envs are as same as aboved, you may get the NaN loss in training.)

I have tried some solutions, and one of them can work:

[1] Remove some data augmentations that may make it run

[2] Change the torch version to 1.8 and make sure your monai is 0.9

[3] However, I think the best idea is to use the docker.