training issue - Githubissues

1286710929 commented 2 years ago

Thank you very much for your wonderful work！ When we use your released code to sh jhu.sh, the code is stuck. We have only one GPU and set the jhu.sh as "--nproc_per_node=1 --master_port 5228 train_distributed.py --gpu_id '0' ". The code turn to:

2022-09-16 12:59:31,334 - CLTR - INFO - => no checkpoint found at 'None' best result: 100000.0 2022-09-16 12:59:31,335 - CLTR - INFO - best result = 100000.000 2022-09-16 12:59:31,364 - CLTR - INFO - best result=100000.000 start epoch=0.000 2022-09-16 12:59:31,364 - CLTR - INFO - start training!

and it stuck.

dk-liang commented 2 years ago

You can directly try 'python train_distributed.py' in terms of 1 GPU

1286710929 commented 2 years ago

You can directly try 'python train_distributed.py' in terms of 1 GPU

thank you for your useful suggestion! We have successfully reproduced the code. However, when the training, the loss could not reduce:

2022-09-16 16:20:56,728 - CLTR - INFO - best result = 100000.000 2022-09-16 16:20:56,760 - CLTR - INFO - best result=100000.000 2022-09-16 16:20:56,760 - CLTR - INFO - start training! 2022-09-16 16:21:47,022 - CLTR - INFO - Training Epoch:[0/1500] 2022-09-16 16:21:47,023 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:22:27,807 - CLTR - INFO - Testing Epoch:[0/1500] 2022-09-16 16:23:17,224 - CLTR - INFO - Training Epoch:[1/1500] 2022-09-16 16:24:05,478 - CLTR - INFO - Training Epoch:[2/1500] 2022-09-16 16:24:53,966 - CLTR - INFO - Training Epoch:[3/1500] 2022-09-16 16:25:41,476 - CLTR - INFO - Training Epoch:[4/1500] 2022-09-16 16:26:32,635 - CLTR - INFO - Training Epoch:[5/1500] 2022-09-16 16:27:26,108 - CLTR - INFO - Training Epoch:[6/1500] 2022-09-16 16:28:20,373 - CLTR - INFO - Training Epoch:[7/1500] 2022-09-16 16:29:14,253 - CLTR - INFO - Training Epoch:[8/1500] 2022-09-16 16:30:05,044 - CLTR - INFO - Training Epoch:[9/1500] 2022-09-16 16:30:58,653 - CLTR - INFO - Training Epoch:[10/1500] 2022-09-16 16:31:52,705 - CLTR - INFO - Training Epoch:[11/1500] 2022-09-16 16:32:47,791 - CLTR - INFO - Training Epoch:[12/1500] 2022-09-16 16:33:41,799 - CLTR - INFO - Training Epoch:[13/1500] 2022-09-16 16:34:34,340 - CLTR - INFO - Training Epoch:[14/1500] 2022-09-16 16:35:24,587 - CLTR - INFO - Training Epoch:[15/1500] 2022-09-16 16:36:17,026 - CLTR - INFO - Training Epoch:[16/1500] 2022-09-16 16:37:08,004 - CLTR - INFO - Training Epoch:[17/1500] 2022-09-16 16:37:59,972 - CLTR - INFO - Training Epoch:[18/1500] 2022-09-16 16:38:58,893 - CLTR - INFO - Training Epoch:[19/1500] 2022-09-16 16:39:59,276 - CLTR - INFO - Training Epoch:[20/1500] 2022-09-16 16:39:59,278 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:40:53,217 - CLTR - INFO - Testing Epoch:[20/1500] 2022-09-16 16:41:51,895 - CLTR - INFO - Training Epoch:[21/1500] 2022-09-16 16:42:45,073 - CLTR - INFO - Training Epoch:[22/1500] 2022-09-16 16:43:43,006 - CLTR - INFO - Training Epoch:[23/1500] 2022-09-16 16:44:33,340 - CLTR - INFO - Training Epoch:[24/1500] 2022-09-16 16:45:25,039 - CLTR - INFO - Training Epoch:[25/1500] 2022-09-16 16:46:17,073 - CLTR - INFO - Training Epoch:[26/1500] 2022-09-16 16:47:05,587 - CLTR - INFO - Training Epoch:[27/1500] 2022-09-16 16:47:57,769 - CLTR - INFO - Training Epoch:[28/1500] 2022-09-16 16:48:48,925 - CLTR - INFO - Training Epoch:[29/1500] 2022-09-16 16:49:43,559 - CLTR - INFO - Training Epoch:[30/1500] 2022-09-16 16:50:36,111 - CLTR - INFO - Training Epoch:[31/1500] 2022-09-16 16:51:27,748 - CLTR - INFO - Training Epoch:[32/1500] 2022-09-16 16:52:22,085 - CLTR - INFO - Training Epoch:[33/1500] 2022-09-16 16:53:16,061 - CLTR - INFO - Training Epoch:[34/1500] 2022-09-16 16:54:08,333 - CLTR - INFO - Training Epoch:[35/1500] 2022-09-16 16:55:07,364 - CLTR - INFO - Training Epoch:[36/1500] 2022-09-16 16:56:00,972 - CLTR - INFO - Training Epoch:[37/1500] 2022-09-16 16:56:53,343 - CLTR - INFO - Training Epoch:[38/1500] 2022-09-16 16:57:46,267 - CLTR - INFO - Training Epoch:[39/1500] 2022-09-16 16:58:38,003 - CLTR - INFO - Training Epoch:[40/1500] 2022-09-16 16:58:38,003 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:59:45,887 - CLTR - INFO - Testing Epoch:[40/1500] start epoch=0.000 loss=10.44413 lr=0.000100 epoch_time=50.262 mae=296.860 mse=734.582 best_mae=296.860
loss=6.85613 lr=0.000100 epoch_time=49.417 loss=7.91702 lr=0.000100 epoch_time=48.254 loss=7.74740 lr=0.000100 epoch_time=48.487 loss=7.69251 lr=0.000100 epoch_time=47.509 loss=7.63444 lr=0.000100 epoch_time=51.157 loss=7.57381 lr=0.000100 epoch_time=53.472 loss=7.68320 lr=0.000100 epoch_time=54.263 loss=7.71494 lr=0.000100 epoch_time=53.878 loss=7.88340 lr=0.000100 epoch_time=50.789 loss=7.63586 lr=0.000100 epoch_time=53.607 loss=7.76085 lr=0.000100 epoch_time=54.051 loss=7.70076 lr=0.000100 epoch_time=55.085 loss=7.67467 lr=0.000100 epoch_time=54.007 loss=7.92861 lr=0.000100 epoch_time=52.539 loss=7.60968 lr=0.000100 epoch_time=50.246 loss=7.71725 lr=0.000100 epoch_time=52.437 loss=7.67870 lr=0.000100 epoch_time=50.977 loss=7.54054 lr=0.000100 epoch_time=51.966 loss=7.73937 lr=0.000100 epoch_time=58.919 loss=7.64956 lr=0.000100 epoch_time=60.381 mae=296.860 mse=734.582 best_mae=296.860
loss=7.61158 lr=0.000100 epoch_time=58.677 loss=7.51920 lr=0.000100 epoch_time=53.177 loss=7.72642 lr=0.000100 epoch_time=57.931 loss=7.64706 lr=0.000100 epoch_time=50.334 loss=7.62301 lr=0.000100 epoch_time=51.697 loss=7.59961 lr=0.000100 epoch_time=52.033 loss=7.59965 lr=0.000100 epoch_time=48.513 loss=7.58723 lr=0.000100 epoch_time=52.180 loss=7.55728 lr=0.000100 epoch_time=51.155 loss=7.46936 lr=0.000100 epoch_time=54.475 loss=7.35466 lr=0.000100 epoch_time=52.551 loss=7.36469 lr=0.000100 epoch_time=51.636 loss=7.30044 lr=0.000100 epoch_time=54.335 loss=7.33537 lr=0.000100 epoch_time=53.974 loss=7.22549 lr=0.000100 epoch_time=52.271 loss=7.21319 lr=0.000100 epoch_time=59.027 loss=7.21706 lr=0.000100 epoch_time=53.606 loss=7.33765 lr=0.000100 epoch_time=52.370 loss=7.19941 lr=0.000100 epoch_time=52.923 loss=7.18334 lr=0.000100 epoch_time=51.735 mae=296.860 mse=734.582 best_mae=296.860

ljc108 commented 1 year ago

@1286710929 hi，I have the same problem as you. Have you solved it？

1286710929 commented 1 year ago

No, i can't solve it.

SherlockHolmes221 commented 9 months ago

@dk-liang @1286710929 @ljc108 , I also meet same problem , mae is same, because the model can not detect any head point, any way to solve?

dk-liang / CLTR

training issue #2