SysCV / sam-hq

Segment Anything in High Quality [NeurIPS 2023]
https://arxiv.org/abs/2306.01567
Apache License 2.0
3.65k stars 219 forks source link

Training loss is Nan #87

Open sweetdream33 opened 11 months ago

sweetdream33 commented 11 months ago

Thank you very much for sharing the code with me.

When I executed the following training script command, even if I run multiple epochs in this way, the training loss is output as nan.

At some point, I checked for nan infection, and the nan value occurred after passing the for blk in self.blocks: x = blk(x) statement on line 119 of segment_anything_training/modeling/image_encoder.py.

my environment settings are like this. Thank you for your help.

torch 1.12.0a0+8a1a93a torch-tensorrt 1.1.0a0 torchtext 0.13.0a0 torchvision 0.13.0a0 Python 3.8.13

!python -m torch.distributed.launch --nproc_per_node=1 ./train.py --checkpoint ./pretrained_checkpoint/sam_vit_h_4b8939.pth --model-type vit_h --output work_dirs/hq_sam_h

Local rank: 0 --- create training dataloader --- ------------------------------ train -------------------------------- --->>> train dataset 0 / 1 ECSSD <<<--- -im- ECSSD ./data/cascade_psp/ecssd : 1000 -gt- ECSSD ./data/cascade_psp/ecssd : 1000 250 train dataloaders created --- create valid dataloader --- ------------------------------ valid -------------------------------- --->>> valid dataset 0 / 1 MSRA10K <<<--- -im- MSRA10K ./data/cascade_psp/MSRA_10K : 10000 -gt- MSRA10K ./data/cascade_psp/MSRA_10K : 10000 1 valid dataloaders created --- define optimizer ---

return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [ 0/250] eta: 0:20:18 training_loss: nan (nan) loss_mask: nan (nan) loss_dice: nan (nan) time: 4.8754 data: 0.8692 max mem: 18409 [249/250] eta: 0:00:02 training_loss: nan (nan) loss_mask: nan (nan) loss_dice: nan (nan) time: 2.3650 data: 0.0020 max mem: 18825 Total time: 0:09:54 (2.3782 s / it) Finished epoch: 0 Averaged stats: training_loss: nan (nan) loss_mask: nan (nan) loss_dice: nan (nan) Validating... valid_dataloader len: 10000 [ 0/10000] eta: 2:20:10 val_iou_0: 0.0000 (0.0000) val_boundary_iou_0: 0.0000 (0.0000) time: 0.8410 data: 0.2587 max mem: 18825

ukaprch commented 9 months ago

From my limited experience a 'nan' returned on a loss function may be due to a cuda out of memory condition and/or a variable requires grad=True was not set prior to.