DIUx-xView / xview3-reference

Reference data processing code and model for the xView3 prize challenge.
Other
44 stars 27 forks source link

Loss is NAN Training Stopped. #18

Open A-n-o-r-a-k opened 10 months ago

A-n-o-r-a-k commented 10 months ago

Working my way through the code I ran into this breaking error:

when running this code

# Train the model for three epochs
for epoch in range(num_epochs):
    # train for one epoch, printing every iteration
    train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_val, device=device)

    checkpoint_path = f'trained_model_{epoch+1}_epochs.pth'
    torch.save(model.state_dict(), checkpoint_path)

It produces the following error:

Epoch: [0] [ 0/65] eta: 0:01:55 lr: 0.000125 loss: 2.9749 (2.9749) loss_classifier: 1.1639 (1.1639) loss_box_reg: 0.0148 (0.0148) loss_objectness: 1.5950 (1.5950) loss_rpn_box_reg: 0.2011 (0.2011) time: 1.7780 data: 0.4853 max mem: 5038 Loss is nan, stopping training {'loss_classifier': tensor(1.3285, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(0.0082, device='cuda:0', grad_fn=), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_box_reg': tensor(0.1605, device='cuda:0', grad_fn=)}

An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

/home/q/anaconda3/envs/xview3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3556: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D. warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

goodgoodstudy2322 commented 3 months ago

I had the same problem. Did you solve it?