Closed 1286710929 closed 1 year ago
You can directly try 'python train_distributed.py' in terms of 1 GPU
You can directly try 'python train_distributed.py' in terms of 1 GPU
thank you for your useful suggestion! We have successfully reproduced the code. However, when the training, the loss could not reduce:
2022-09-16 16:20:56,728 - CLTR - INFO - best result = 100000.000
2022-09-16 16:20:56,760 - CLTR - INFO - best result=100000.000 start epoch=0.000
2022-09-16 16:20:56,760 - CLTR - INFO - start training!
2022-09-16 16:21:47,022 - CLTR - INFO - Training Epoch:[0/1500] loss=10.44413 lr=0.000100 epoch_time=50.262
2022-09-16 16:21:47,023 - CLTR - INFO - begin test
mae 296.86 mse 734.5822894679669
2022-09-16 16:22:27,807 - CLTR - INFO - Testing Epoch:[0/1500] mae=296.860 mse=734.582 best_mae=296.860
2022-09-16 16:23:17,224 - CLTR - INFO - Training Epoch:[1/1500] loss=6.85613 lr=0.000100 epoch_time=49.417
2022-09-16 16:24:05,478 - CLTR - INFO - Training Epoch:[2/1500] loss=7.91702 lr=0.000100 epoch_time=48.254
2022-09-16 16:24:53,966 - CLTR - INFO - Training Epoch:[3/1500] loss=7.74740 lr=0.000100 epoch_time=48.487
2022-09-16 16:25:41,476 - CLTR - INFO - Training Epoch:[4/1500] loss=7.69251 lr=0.000100 epoch_time=47.509
2022-09-16 16:26:32,635 - CLTR - INFO - Training Epoch:[5/1500] loss=7.63444 lr=0.000100 epoch_time=51.157
2022-09-16 16:27:26,108 - CLTR - INFO - Training Epoch:[6/1500] loss=7.57381 lr=0.000100 epoch_time=53.472
2022-09-16 16:28:20,373 - CLTR - INFO - Training Epoch:[7/1500] loss=7.68320 lr=0.000100 epoch_time=54.263
2022-09-16 16:29:14,253 - CLTR - INFO - Training Epoch:[8/1500] loss=7.71494 lr=0.000100 epoch_time=53.878
2022-09-16 16:30:05,044 - CLTR - INFO - Training Epoch:[9/1500] loss=7.88340 lr=0.000100 epoch_time=50.789
2022-09-16 16:30:58,653 - CLTR - INFO - Training Epoch:[10/1500] loss=7.63586 lr=0.000100 epoch_time=53.607
2022-09-16 16:31:52,705 - CLTR - INFO - Training Epoch:[11/1500] loss=7.76085 lr=0.000100 epoch_time=54.051
2022-09-16 16:32:47,791 - CLTR - INFO - Training Epoch:[12/1500] loss=7.70076 lr=0.000100 epoch_time=55.085
2022-09-16 16:33:41,799 - CLTR - INFO - Training Epoch:[13/1500] loss=7.67467 lr=0.000100 epoch_time=54.007
2022-09-16 16:34:34,340 - CLTR - INFO - Training Epoch:[14/1500] loss=7.92861 lr=0.000100 epoch_time=52.539
2022-09-16 16:35:24,587 - CLTR - INFO - Training Epoch:[15/1500] loss=7.60968 lr=0.000100 epoch_time=50.246
2022-09-16 16:36:17,026 - CLTR - INFO - Training Epoch:[16/1500] loss=7.71725 lr=0.000100 epoch_time=52.437
2022-09-16 16:37:08,004 - CLTR - INFO - Training Epoch:[17/1500] loss=7.67870 lr=0.000100 epoch_time=50.977
2022-09-16 16:37:59,972 - CLTR - INFO - Training Epoch:[18/1500] loss=7.54054 lr=0.000100 epoch_time=51.966
2022-09-16 16:38:58,893 - CLTR - INFO - Training Epoch:[19/1500] loss=7.73937 lr=0.000100 epoch_time=58.919
2022-09-16 16:39:59,276 - CLTR - INFO - Training Epoch:[20/1500] loss=7.64956 lr=0.000100 epoch_time=60.381
2022-09-16 16:39:59,278 - CLTR - INFO - begin test
mae 296.86 mse 734.5822894679669
2022-09-16 16:40:53,217 - CLTR - INFO - Testing Epoch:[20/1500] mae=296.860 mse=734.582 best_mae=296.860
2022-09-16 16:41:51,895 - CLTR - INFO - Training Epoch:[21/1500] loss=7.61158 lr=0.000100 epoch_time=58.677
2022-09-16 16:42:45,073 - CLTR - INFO - Training Epoch:[22/1500] loss=7.51920 lr=0.000100 epoch_time=53.177
2022-09-16 16:43:43,006 - CLTR - INFO - Training Epoch:[23/1500] loss=7.72642 lr=0.000100 epoch_time=57.931
2022-09-16 16:44:33,340 - CLTR - INFO - Training Epoch:[24/1500] loss=7.64706 lr=0.000100 epoch_time=50.334
2022-09-16 16:45:25,039 - CLTR - INFO - Training Epoch:[25/1500] loss=7.62301 lr=0.000100 epoch_time=51.697
2022-09-16 16:46:17,073 - CLTR - INFO - Training Epoch:[26/1500] loss=7.59961 lr=0.000100 epoch_time=52.033
2022-09-16 16:47:05,587 - CLTR - INFO - Training Epoch:[27/1500] loss=7.59965 lr=0.000100 epoch_time=48.513
2022-09-16 16:47:57,769 - CLTR - INFO - Training Epoch:[28/1500] loss=7.58723 lr=0.000100 epoch_time=52.180
2022-09-16 16:48:48,925 - CLTR - INFO - Training Epoch:[29/1500] loss=7.55728 lr=0.000100 epoch_time=51.155
2022-09-16 16:49:43,559 - CLTR - INFO - Training Epoch:[30/1500] loss=7.46936 lr=0.000100 epoch_time=54.475
2022-09-16 16:50:36,111 - CLTR - INFO - Training Epoch:[31/1500] loss=7.35466 lr=0.000100 epoch_time=52.551
2022-09-16 16:51:27,748 - CLTR - INFO - Training Epoch:[32/1500] loss=7.36469 lr=0.000100 epoch_time=51.636
2022-09-16 16:52:22,085 - CLTR - INFO - Training Epoch:[33/1500] loss=7.30044 lr=0.000100 epoch_time=54.335
2022-09-16 16:53:16,061 - CLTR - INFO - Training Epoch:[34/1500] loss=7.33537 lr=0.000100 epoch_time=53.974
2022-09-16 16:54:08,333 - CLTR - INFO - Training Epoch:[35/1500] loss=7.22549 lr=0.000100 epoch_time=52.271
2022-09-16 16:55:07,364 - CLTR - INFO - Training Epoch:[36/1500] loss=7.21319 lr=0.000100 epoch_time=59.027
2022-09-16 16:56:00,972 - CLTR - INFO - Training Epoch:[37/1500] loss=7.21706 lr=0.000100 epoch_time=53.606
2022-09-16 16:56:53,343 - CLTR - INFO - Training Epoch:[38/1500] loss=7.33765 lr=0.000100 epoch_time=52.370
2022-09-16 16:57:46,267 - CLTR - INFO - Training Epoch:[39/1500] loss=7.19941 lr=0.000100 epoch_time=52.923
2022-09-16 16:58:38,003 - CLTR - INFO - Training Epoch:[40/1500] loss=7.18334 lr=0.000100 epoch_time=51.735
2022-09-16 16:58:38,003 - CLTR - INFO - begin test
mae 296.86 mse 734.5822894679669
2022-09-16 16:59:45,887 - CLTR - INFO - Testing Epoch:[40/1500] mae=296.860 mse=734.582 best_mae=296.860
@1286710929 hi,I have the same problem as you. Have you solved it?
No, i can't solve it.
@dk-liang @1286710929 @ljc108 , I also meet same problem , mae is same, because the model can not detect any head point, any way to solve?
Thank you very much for your wonderful work! When we use your released code to sh jhu.sh, the code is stuck. We have only one GPU and set the jhu.sh as "--nproc_per_node=1 --master_port 5228 train_distributed.py --gpu_id '0' ". The code turn to:
2022-09-16 12:59:31,334 - CLTR - INFO - => no checkpoint found at 'None' best result: 100000.0 2022-09-16 12:59:31,335 - CLTR - INFO - best result = 100000.000 2022-09-16 12:59:31,364 - CLTR - INFO - best result=100000.000 start epoch=0.000 2022-09-16 12:59:31,364 - CLTR - INFO - start training!
and it stuck.