eyalbetzalel commented 1 year ago

Hi,

When I train the network from the begining it works fine but when I resume the training from the checkpoint file (ViT for COCO from epoch 6 that is posted here) I get this issue:

the first batch works ok and the model output an accurate segmentation map.
after the first batch the segmentation map becomes NaN (and loss too).

I tried to decrease LR and it didn't help.

any ideas?

voidrank commented 1 year ago

Hi @eyalbetzalel

How many times have you trained and what model did you use?

Best,

Shiyi

liusurufeng commented 1 year ago

@voidrank Hi， when I use my own dataset to train the MAL，the result is: val/mIoU_small: 0.4333444 val/mIoU_medium: 0.523455 val/mIoU_large: nan but when I try to generate the pesudo label,the whole results is wrong,the detailed situation is as follows: Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11390/11390 [12:00<00:00, 12.46it/s]val/mIoU: nan val/mIoU_small: nan val/mIoU_medium: nan val/mIoU_large: nan

I don't know what caused this issue?

eyalbetzalel commented 1 year ago

@voidrank Hi， when I use my own dataset to train the MAL，the result is: val/mIoU_small: 0.4333444 val/mIoU_medium: 0.523455 val/mIoU_large: nan but when I try to generate the pesudo label,the whole results is wrong,the detailed situation is as follows: Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11390/11390 [12:00<00:00, 12.46it/s]val/mIoU: nan val/mIoU_small: nan val/mIoU_medium: nan val/mIoU_large: nan

I don't know what caused this issue?

I found a bug in their optimizer implementation. after switching it to SGD with momentum the problem had been solved.

` def configure_optimizers(self):

optimizer = AdamWwStep(self.parameters(), eps=self.args.optim_eps,

    #                         betas=self.args.optim_betas,
    #                         lr=self._lr, weight_decay=self._wd)
    optimizer = torch.optim.SGD(self.parameters(), lr=self._lr, momentum=0.9)
    return optimizer

`

eyalbetzalel commented 1 year ago

g in their optimizer implementation. after switching it to SGD with moment

Hi, sorry I haven't responded. I missed your massage. The problem, as mentioned in the comment above was in the optimizer.

voidrank commented 1 year ago

Hi @eyalbetzalel , You will get NaN scores if you don't provide a label for specific categories.

NVlabs / mask-auto-labeler

NaN segmentation map when using Phase 1 cp #14

optimizer = AdamWwStep(self.parameters(), eps=self.args.optim_eps,