feifeiobama / OrthogonalDet

[CVPR 2024] Exploring Orthogonality in Open World Object Detection
34 stars 3 forks source link

Loss is NaN during the training #9

Closed pgh2874 closed 1 month ago

pgh2874 commented 2 months ago

Thank you for your great work! By the way, I got stuck in a problem during the reproduction of the method. I observed Loss became Nan at the Task3 Finetuning Sessions.

I append a part of logs regarding the error message below: FloatingPointError: Loss became infinite or NaN at iteration=58098! loss_dict = {'loss_ce': 0.13315928354859352, 'loss_bbox': 0.08427525404840708, 'loss_giou': 0.2164861522614956, 'loss_nc_ce': 0.11209869757294655, 'loss_decorr': 0.0036641834885813296, 'loss_ce_0': nan, 'loss_bbox_0': 0.2516929730772972, 'loss_giou_0': 0.534403569996357, 'loss_nc_ce_0': nan, 'loss_decorr_0': nan, 'loss_ce_1': 0.27496881783008575, 'loss_bbox_1': 0.1268935166299343, 'loss_giou_1': 0.31266121566295624, 'loss_nc_ce_1': 0.08298327401280403, 'loss_decorr_1': 0.008526564692147076, 'loss_ce_2': 0.22203907370567322, 'loss_bbox_2': 0.10792248137295246, 'loss_giou_2': 0.269552618265152, 'loss_nc_ce_2': 0.09135140851140022, 'loss_decorr_2': 0.0066769670229405165, 'loss_ce_3': 0.1459853444248438, 'loss_bbox_3': 0.10060846898704767, 'loss_giou_3': 0.2483709417283535, 'loss_nc_ce_3': 0.09946945682168007, 'loss_decorr_3': 0.010072911740280688, 'loss_ce_4': 0.13418008014559746, 'loss_bbox_4': 0.08722878061234951, 'loss_giou_4': 0.22286804765462875, 'loss_nc_ce_4': 0.09788228385150433, 'loss_decorr_4': 0.0072590073104947805}

how can I handle this?

Thanks in advance

feifeiobama commented 2 months ago

Please try to restart from the checkpoint at the end of Task2 (and remember to modify the last_checkpoint file in the output directory to specify this checkpoint).

pgh2874 commented 1 month ago

Sorry for the delayed reply. As you mentioned, I did restart from the checkpoint point and it works.

Thank you!!