Loss suddenly increased while training and training process crashed at 8 to 10 epoch

zhiyuanyou commented 4 years ago

Hello! I am a SJTUer who followed your job. I have read your 2 papers and tested your code on my PC. Thanks for your code and amazing idea.

I ran the "train_eval.py" code by: python train_eval.py --cfg experiments/vgg16_pca_voc.yaml The configuration file 'vgg16_pca_voc.yaml' has not been modified.

However, I have met 2 problems. Firstly, at epoch 5 to 8, the loss suddenly increase and cannot decline, as shown like this: I have tried 4 times, the loss always crashed at epoch 5 to 8. The highest average eval accuracy was always about 0.6+.

Secondly, after the loss crashed, the training process suddenly crashed at epoch 8 to 10 (after the loss's crash, not simultaneously), as shown like this: Screenshot from 2019-12-15 20-37-55

Traceback (most recent call last): File "train_eval.py", line 228, in start_epoch=cfg.TRAIN.START_EPOCH) File "train_eval.py", line 104, in train_eval_model loss = criterion(s_pred, perm_mat, n1_gt, n2_gt) File "/home/youzhiyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/home/youzhiyuan/Desktop/GraphMatch/PCA-GM-master/utils/permutation_loss.py", line 18, in forward assert torch.all((pred_perm >= 0) (pred_perm <= 1)) AssertionError

I have met this problem 3 times. The highest average eval accuracy was also about 0.6+.

My PC configuration: Ubuntu 16.04 torch 1.2.0+cu92 torchvision 0.4.0+cu92

Thanks for your time and help in advance.

rogerwwww commented 4 years ago

Hi, thank you for your interest! Your problem seems similar to https://github.com/Thinklab-SJTU/PCA-GM/issues/4

As the accuracy before loss increase is moderately good, I think it is just fine.

zhiyuanyou commented 4 years ago

Thanks for your immediate response. I have noticed the problem #4. However, how about the training process's crash? Is there some relationship with the loss's crash?

rogerwwww commented 4 years ago

Yes, there is.

According to my experience, loss increase is usually followed by a training crash. It might be some unstable gradient that causes these two issues, but I haven't studied it thoroughly.

zhiyuanyou commented 4 years ago

Thanks very much for your response.

Thinklab-SJTU / ThinkMatch

Loss suddenly increased while training and training process crashed at 8 to 10 epoch #10