Nan loss on the half way to train the model.

LT1st commented 6 months ago

I am using your framework on image translation task.

The loss was fine at the very begining, but was changed into Nan in Epoch 006. Have you ever solved this problem?

The log infomation:

Epoch: 005 - 025Epoch: [5][0/1250]  Loss: 0.5269870162010193, LR: 0.00020489077162409578
Epoch: [5][800/1250]    Loss: 0.44976134161080017, LR: 0.00022987952270145035
Epoch: [5][900/1250]    Loss: 0.44960858015453115, LR: 0.00023297791072928454
Epoch: [5][1000/1250]   Loss: 0.4489031859657743, LR: 0.00023607105232487043
Epoch: [5][1100/1250]   Loss: 0.45127786015186605, LR: 0.00023915904358565203
Epoch: [5][1200/1250]   Loss: 0.45215647128549447, LR: 0.00024224197730360856

Epoch: 005 - 025
====================================================================================================
        d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
    0.2776     0.6083     0.7993     0.8979    77.8871    60.1445     0.6755     0.2110     0.6634 
====================================================================================================

Epoch: 006 - 025Epoch: [6][0/1250]  Loss: 0.4362526535987854, LR: 0.00024378157571854745
Epoch: [6][100/1250]    Loss: nan, LR: 0.0002468570902802666
Epoch: [6][200/1250]    Loss: nan, LR: 0.0002499277658752283
Epoch: [6][900/1250]    Loss: nan, LR: 0.00027129361225600624
Epoch: [6][1000/1250]   Loss: nan, LR: 0.0002743283372183186
Epoch: [6][1100/1250]   Loss: nan, LR: 0.00027735887967590467
Epoch: [6][1200/1250]   Loss: nan, LR: 0.0002803853021768897

The output goes:

NaN or Inf found in input tensor.

====================================================================================================
        d1         d2         d3    abs_rel     sq_rel       rmse   rmse_log      log10      silog 
    0.0000     0.0000     0.0000     0.9861   131.8554   146.8084    11.4800     4.9767     8.1322 
====================================================================================================

Epoch: 009 - 025
Epoch: [9][0/1250]      Loss: nan, LR: 0.0003563283532129068
Epoch: [9][1200/1250]   Loss: nan, LR: 0.00039136562872899835
NaN or Inf found in input tensor.
nan
nan
nan
nan
nan

It seems to be something wrong, but the inputs are right at the beging.

LT1st commented 6 months ago

I've just dopuble checked the dataset, it's right.

    from torch.utils.data import DataLoader
    import tqdm
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
    for i, x in enumerate(dataloader):
        print(f'Batch {i}:')
        print(x['image'].shape, x['depth'].shape)
        # print('Data:', data.shape)
        # print('Label:', label)

wwqq commented 6 months ago

Reducing the learning rate or increasing the weight decay might solve this problem, but it could also impact performance.

fudan-zvg / meta-prompts

Nan loss on the half way to train the model. #5