erikwijmans / Pointnet2_PyTorch

PyTorch implementation of Pointnet2/Pointnet++
The Unlicense
1.53k stars 345 forks source link

Training stops with no warning nor error. #142

Open priceee opened 3 years ago

priceee commented 3 years ago

I'm running at Python 3.7, torch 1.6. The training stops with no warning nor error. Only "Process finished with exit code 0" I've tried to run it with "nohup", but it still stops at 20~30 epoches.

Part of the console output:

Epoch 00028: val_acc reached 0.90505 (best 0.90505), saving model to cls-ssg/epoch=28-val_loss=0.30-val_acc=0.905.ckpt as top 2 Epoch 30: 91%|▉| 350/385 [01:25<00:09, 3.68it/s, loss=0.198, train_acc=0.906, Validating: 0%| | 0/78 [00:00<?, ?it/s] Epoch 30: : 400it [01:27, 5.05it/s, loss=0.198, train_acc=0.906, v_num=42, val_acc=0.905, val_loss=0.302] Epoch 30: : 450it [01:35, 5.95it/s, loss=0.193, train_acc=0.938, v_num=42, val_acc=0.903, val_loss=0.304] [2021-04-23 14:40:17,513][root][INFO] - Epoch 00029: val_acc was not in top 2 Epoch 31: 91%|▉| 350/385 [01:25<00:09, 3.68it/s, loss=0.209, train_acc=0.875, Validating: 0%| | 0/78 [00:00<?, ?it/s] Epoch 31: : 400it [01:27, 5.04it/s, loss=0.209, train_acc=0.875, v_num=42, val_acc=0.903, val_loss=0.304] Epoch 31: : 450it [01:35, 5.94it/s, loss=0.210, train_acc=0.969, v_num=42, val_acc=0.903, val_loss=0.303] [2021-04-23 14:41:52,866][root][INFO] - Epoch 00030: val_acc was not in top 2 Epoch 31: : 450it [01:35, 4.72it/s, loss=0.210, train_acc=0.969, v_num=42, val_acc=0.903, val_loss=0.303]

Process finished with exit code 0

isabellahuang commented 3 years ago

Running into a similar issue with Python 3.6 and torch 1.4 Have you found a fix? I'm guessing it might be by design if your val_acc stops increasing (#135 ),

kdh2769 commented 2 years ago

This code uses early stop in 'train.py' line31. You have to change patience value more than 5.