Predictions become NaN after certain number of epochs

DmitriiSavin commented 3 hours ago

Hi! Thank you for your work on this project!

I'm training the s-model on a custom dataset, and I’ve encountered an issue after several successful epochs. Up until the 12th epoch, training seems to be progressing well: the loss is decreasing steadily, and validation metrics are improving. However, starting from the 12th epoch, the model begins producing NaN predictions, and the following code is triggered:

            if torch.isnan(outputs['pred_boxes']).any() or torch.isinf(outputs['pred_boxes']).any():
                print(outputs['pred_boxes'])
                state = model.state_dict()
                new_state = {}
                for key, value in model.state_dict().items():
                    # Replace 'module' with 'model' in each key
                    new_key = key.replace('module.', '')
                    # Add the updated key-value pair to the state dictionary
                    state[new_key] = value
                new_state['model'] = state
                dist_utils.save_on_master(new_state, "./NaN.pth")

Here is how the CLI output looks like:

Here is the config:

Not init distributed mode.
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': 'weight/hgnetv2/dfine_s_obj365.pth', 'epoches': 64, 'last_epoch': -1, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 12, 'output_dir': './output/dfine_hgnetv2_s_obj2custom', 'summary_dir': None, 'device': '', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'CocoEvaluator', 'iou_types': ['bbox']}, 'num_classes': 1, 'remap_mscoco_category': False, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': 'data/ce_marks/images/train', 'ann_file': 'data/ce_marks/annotations/instances_train.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': 56, 'ops': ['RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}}}, 'shuffle': True, 'num_workers': 0, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 10, 'stop_epoch': 56, 'ema_restart_decay': 0.9999}, 'total_batch_size': 16}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': 'data/ce_marks/images/val', 'ann_file': 'data/ce_marks/annotations/instances_val.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 0, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFunction'}, 'total_batch_size': 16}, 'print_freq': 100, 'output_dir': './output/dfine_hgnetv2_s_obj2custom', 'checkpoint_freq': 12, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 0, 'start': 0}, 'epoches': 64, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*backbone)(?!.*norm|bn).*$', 'lr': 0.00025}, {'params': '^(?=.*backbone)(?=.*norm|bn).*$', 'lr': 0.00025, 'weight_decay': 0.0}, {'params': '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$', 'weight_decay': 0.0}], 'lr': 0.005, 'betas': [0.9, 0.999], 'weight_decay': 0.000125}, 'lr_scheduler': {'type': 'CosineAnnealingLR', 'T_max': 64}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 0}, 'model': 'DFINE', 'criterion': 'DFINECriterion', 'postprocessor': 'DFINEPostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DFINE': {'backbone': 'HGNetv2', 'encoder': 'HybridEncoder', 'decoder': 'DFINETransformer'}, 'HGNetv2': {'pretrained': False, 'local_model_dir': 'weight/hgnetv2/', 'name': 'B0', 'return_idx': [1, 2, 3], 'freeze_at': -1, 'freeze_norm': False, 'use_lab': True}, 'HybridEncoder': {'in_channels': [256, 512, 1024], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 0.34, 'act': 'silu'}, 'DFINETransformer': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'eval_idx': -1, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'reg_max': 32, 'reg_scale': 4, 'layer_scale': 1, 'num_points': [3, 6, 3], 'cross_attn_method': 'default', 'query_select_method': 'default'}, 'DFINEPostProcessor': {'num_top_queries': 300}, 'DFINECriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_fgl': 0.15, 'loss_ddf': 1.5}, 'losses': ['vfl', 'boxes', 'local'], 'alpha': 0.75, 'gamma': 2.0, 'reg_max': 32, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['../../../dataset/custom_detection.yml', '../../../runtime.yml', '../.././include/dataloader.yml', '../.././include/optimizer.yml', '../.././include/dfine_hgnetv2.yml'], 'config': 'configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml', 'tuning': 'weight/hgnetv2/dfine_s_obj365.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}

Do you have any insights on why this might be happening?

DmitriiSavin commented 2 hours ago

It looks pretty similar to the following issue: https://github.com/Peterande/D-FINE/issues/3

Peterande commented 2 hours ago

It looks pretty similar to the following issue: #3

You can try resuming from NaN.pth in debugging mode. Random numerical overflows in AMP are often unavoidable. They can occur anywhere. You need to locate the place where the overflow occurs and then clamp it to the allowed value.

Peterande / D-FINE

Predictions become NaN after certain number of epochs #43