dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
4.98k stars 1.33k forks source link

Moving average ignored a value of inf/nan when training on 200x200 images #528

Open zacurr opened 3 years ago

zacurr commented 3 years ago

I succesfully trained yolact resnet18 with 200x200 image using old code of yolact (in this old code, latest merged commit is https://github.com/dbolya/yolact/commit/f46dc4385a41ed1f2df6716ecf6084081afcbec6 )

yolact_resnet18_im200_config = yolact_base_config.copy({ 'name': 'yolact_resnet18_im200',

'max_size': 200,
'backbone': resnet18_backbone.copy({
    'selected_layers': list(range(1, 4)),
    'pred_scales': [[int(x[0] / yolact_base_config.max_size * 200)] for x in
                    yolact_base_config.backbone.pred_scales],
    'pred_aspect_ratios': yolact_base_config.backbone.pred_aspect_ratios,

    'use_pixel_scales': True,
    'preapply_sqrt': False,
    'use_square_anchors': True,  
}),

})

However, after i clone the latest code and try to train the same model again, this messages was shown and loss exploded.. "Moving average ignored a value of inf/nan"

this happens right after warmup phase or in the middle of 2 epoch. whether i use multi or single gpu, this happens.

and even if i change these parameters to 0 (since this is not in old version) 'discard_box_width': 4 / 550, 'discard_box_height': 4 / 550, again it happens.

the commit after this might affect this instability. https://github.com/dbolya/yolact/commit/f46dc4385a41ed1f2df6716ecf6084081afcbec6

when i train the same model again with old version, it does not happen.

cyrilzakka commented 3 years ago

@zacurr did you ever find a solution? And does the f46dc43 commit work?