dk-liang / CLTR

[ECCV 2022] An End-to-End Transformer Model for Crowd Localization
MIT License
87 stars 13 forks source link

training issue #2

Closed 1286710929 closed 1 year ago

1286710929 commented 2 years ago

Thank you very much for your wonderful work! When we use your released code to sh jhu.sh, the code is stuck. We have only one GPU and set the jhu.sh as "--nproc_per_node=1 --master_port 5228 train_distributed.py --gpu_id '0' ". The code turn to:

2022-09-16 12:59:31,334 - CLTR - INFO - => no checkpoint found at 'None' best result: 100000.0 2022-09-16 12:59:31,335 - CLTR - INFO - best result = 100000.000 2022-09-16 12:59:31,364 - CLTR - INFO - best result=100000.000 start epoch=0.000 2022-09-16 12:59:31,364 - CLTR - INFO - start training!

and it stuck.

dk-liang commented 2 years ago

You can directly try 'python train_distributed.py' in terms of 1 GPU

1286710929 commented 2 years ago

You can directly try 'python train_distributed.py' in terms of 1 GPU

thank you for your useful suggestion! We have successfully reproduced the code. However, when the training, the loss could not reduce:

2022-09-16 16:20:56,728 - CLTR - INFO - best result = 100000.000 2022-09-16 16:20:56,760 - CLTR - INFO - best result=100000.000 start epoch=0.000 2022-09-16 16:20:56,760 - CLTR - INFO - start training! 2022-09-16 16:21:47,022 - CLTR - INFO - Training Epoch:[0/1500] loss=10.44413 lr=0.000100 epoch_time=50.262 2022-09-16 16:21:47,023 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:22:27,807 - CLTR - INFO - Testing Epoch:[0/1500] mae=296.860 mse=734.582 best_mae=296.860
2022-09-16 16:23:17,224 - CLTR - INFO - Training Epoch:[1/1500] loss=6.85613 lr=0.000100 epoch_time=49.417 2022-09-16 16:24:05,478 - CLTR - INFO - Training Epoch:[2/1500] loss=7.91702 lr=0.000100 epoch_time=48.254 2022-09-16 16:24:53,966 - CLTR - INFO - Training Epoch:[3/1500] loss=7.74740 lr=0.000100 epoch_time=48.487 2022-09-16 16:25:41,476 - CLTR - INFO - Training Epoch:[4/1500] loss=7.69251 lr=0.000100 epoch_time=47.509 2022-09-16 16:26:32,635 - CLTR - INFO - Training Epoch:[5/1500] loss=7.63444 lr=0.000100 epoch_time=51.157 2022-09-16 16:27:26,108 - CLTR - INFO - Training Epoch:[6/1500] loss=7.57381 lr=0.000100 epoch_time=53.472 2022-09-16 16:28:20,373 - CLTR - INFO - Training Epoch:[7/1500] loss=7.68320 lr=0.000100 epoch_time=54.263 2022-09-16 16:29:14,253 - CLTR - INFO - Training Epoch:[8/1500] loss=7.71494 lr=0.000100 epoch_time=53.878 2022-09-16 16:30:05,044 - CLTR - INFO - Training Epoch:[9/1500] loss=7.88340 lr=0.000100 epoch_time=50.789 2022-09-16 16:30:58,653 - CLTR - INFO - Training Epoch:[10/1500] loss=7.63586 lr=0.000100 epoch_time=53.607 2022-09-16 16:31:52,705 - CLTR - INFO - Training Epoch:[11/1500] loss=7.76085 lr=0.000100 epoch_time=54.051 2022-09-16 16:32:47,791 - CLTR - INFO - Training Epoch:[12/1500] loss=7.70076 lr=0.000100 epoch_time=55.085 2022-09-16 16:33:41,799 - CLTR - INFO - Training Epoch:[13/1500] loss=7.67467 lr=0.000100 epoch_time=54.007 2022-09-16 16:34:34,340 - CLTR - INFO - Training Epoch:[14/1500] loss=7.92861 lr=0.000100 epoch_time=52.539 2022-09-16 16:35:24,587 - CLTR - INFO - Training Epoch:[15/1500] loss=7.60968 lr=0.000100 epoch_time=50.246 2022-09-16 16:36:17,026 - CLTR - INFO - Training Epoch:[16/1500] loss=7.71725 lr=0.000100 epoch_time=52.437 2022-09-16 16:37:08,004 - CLTR - INFO - Training Epoch:[17/1500] loss=7.67870 lr=0.000100 epoch_time=50.977 2022-09-16 16:37:59,972 - CLTR - INFO - Training Epoch:[18/1500] loss=7.54054 lr=0.000100 epoch_time=51.966 2022-09-16 16:38:58,893 - CLTR - INFO - Training Epoch:[19/1500] loss=7.73937 lr=0.000100 epoch_time=58.919 2022-09-16 16:39:59,276 - CLTR - INFO - Training Epoch:[20/1500] loss=7.64956 lr=0.000100 epoch_time=60.381 2022-09-16 16:39:59,278 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:40:53,217 - CLTR - INFO - Testing Epoch:[20/1500] mae=296.860 mse=734.582 best_mae=296.860
2022-09-16 16:41:51,895 - CLTR - INFO - Training Epoch:[21/1500] loss=7.61158 lr=0.000100 epoch_time=58.677 2022-09-16 16:42:45,073 - CLTR - INFO - Training Epoch:[22/1500] loss=7.51920 lr=0.000100 epoch_time=53.177 2022-09-16 16:43:43,006 - CLTR - INFO - Training Epoch:[23/1500] loss=7.72642 lr=0.000100 epoch_time=57.931 2022-09-16 16:44:33,340 - CLTR - INFO - Training Epoch:[24/1500] loss=7.64706 lr=0.000100 epoch_time=50.334 2022-09-16 16:45:25,039 - CLTR - INFO - Training Epoch:[25/1500] loss=7.62301 lr=0.000100 epoch_time=51.697 2022-09-16 16:46:17,073 - CLTR - INFO - Training Epoch:[26/1500] loss=7.59961 lr=0.000100 epoch_time=52.033 2022-09-16 16:47:05,587 - CLTR - INFO - Training Epoch:[27/1500] loss=7.59965 lr=0.000100 epoch_time=48.513 2022-09-16 16:47:57,769 - CLTR - INFO - Training Epoch:[28/1500] loss=7.58723 lr=0.000100 epoch_time=52.180 2022-09-16 16:48:48,925 - CLTR - INFO - Training Epoch:[29/1500] loss=7.55728 lr=0.000100 epoch_time=51.155 2022-09-16 16:49:43,559 - CLTR - INFO - Training Epoch:[30/1500] loss=7.46936 lr=0.000100 epoch_time=54.475 2022-09-16 16:50:36,111 - CLTR - INFO - Training Epoch:[31/1500] loss=7.35466 lr=0.000100 epoch_time=52.551 2022-09-16 16:51:27,748 - CLTR - INFO - Training Epoch:[32/1500] loss=7.36469 lr=0.000100 epoch_time=51.636 2022-09-16 16:52:22,085 - CLTR - INFO - Training Epoch:[33/1500] loss=7.30044 lr=0.000100 epoch_time=54.335 2022-09-16 16:53:16,061 - CLTR - INFO - Training Epoch:[34/1500] loss=7.33537 lr=0.000100 epoch_time=53.974 2022-09-16 16:54:08,333 - CLTR - INFO - Training Epoch:[35/1500] loss=7.22549 lr=0.000100 epoch_time=52.271 2022-09-16 16:55:07,364 - CLTR - INFO - Training Epoch:[36/1500] loss=7.21319 lr=0.000100 epoch_time=59.027 2022-09-16 16:56:00,972 - CLTR - INFO - Training Epoch:[37/1500] loss=7.21706 lr=0.000100 epoch_time=53.606 2022-09-16 16:56:53,343 - CLTR - INFO - Training Epoch:[38/1500] loss=7.33765 lr=0.000100 epoch_time=52.370 2022-09-16 16:57:46,267 - CLTR - INFO - Training Epoch:[39/1500] loss=7.19941 lr=0.000100 epoch_time=52.923 2022-09-16 16:58:38,003 - CLTR - INFO - Training Epoch:[40/1500] loss=7.18334 lr=0.000100 epoch_time=51.735 2022-09-16 16:58:38,003 - CLTR - INFO - begin test mae 296.86 mse 734.5822894679669 2022-09-16 16:59:45,887 - CLTR - INFO - Testing Epoch:[40/1500] mae=296.860 mse=734.582 best_mae=296.860

ljc108 commented 1 year ago

@1286710929 hi,I have the same problem as you. Have you solved it?

1286710929 commented 1 year ago

No, i can't solve it.

SherlockHolmes221 commented 9 months ago

@dk-liang @1286710929 @ljc108 , I also meet same problem , mae is same, because the model can not detect any head point, any way to solve?