The training process was stopped

wheat-yjy commented 4 years ago

Dear XiaLiPKU, I clone your codes and followed the step to train in my servers, but there were just three lines:

2020-03-13 21:29:17,166 - INFO - set log dir as ./logdir 2020-03-13 21:29:17,166 - INFO - set model dir as ./models 2020-03-13 21:29:19,127 - ERROR - No checkpoint ./models/latest.pth!>

I know that error will not influence my training process. but there were no models saved in the ./models and when I run "sh tensorboard.sh", there was nothing. It seems that the training process was stopped. I just replace obj.cuda(async=True) with obj.cuda(non_blocking=True), then I didn't change any codes. Could you help me?

Thanks!

wheat-yjy commented 4 years ago

I add two lines of print, but when I run the code, I can just see "start", I can't see loss.

print("start") loss = sess.train_batch(image, label) print(loss)

XiaLiPKU commented 4 years ago

I add two lines of print, but when I run the code, I can just see "start", I can't see loss.

print("start") loss = sess.train_batch(image, label) print(loss)

Maybe this issue can help you. https://github.com/XiaLiPKU/EMANet/issues/12

XiaLiPKU / EMANet

The training process was stopped #34