Closed dong-0412 closed 2 years ago
I see. Thank you for pointing out one missing implementation in the single GPU.
I will update the code tomorrow, but you can follow below instructions to revise the code first.
I provide a ipynb called like trained_esc50.ipynb in the folder, you can follow it to train the esc-50 from scratch. This should work.
If you want to solve this problem in testing directly from checkpoint, you can direct to sed_model.py. There are two methods: valid_epoch_end and test_epoch_end.
You can see in the valid_epoch_end, there is a condition to judge if the model is running on single or multi_gpu (device_count). And if the model is running on multi gpus, we should use the gather function to gather all results from different gpus (I.e preds and targets). If not, using directly the preds and targets in this gpu (because there is only one gpu).
But in test_epoch_end, I forget to make this, that’s why you face a problem in testing. You can follow the code from valid_epoch_end to change the test_epoch_end. Or, like you said, comment all gather functions and dist functions, but change the comparison terms to be preds and targets variables. That should be fine.
I will update this code. If you finish it, you can also send a pull request, which makes me more convenient to change it.
Global seed set to 970131
each batch size: 8
INFO:root:total dataset size: 400
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 98%|█████████▊| 49/50 [00:03<00:00, 22.58it/s]Traceback (most recent call last):
File "E:/HTS/HTS/main.py", line 429, in
File "E:/HTS/HTS/main.py", line 417, in main
File "E:/HTS/HTS/main.py", line 247, in test
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 911, in test
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 954, in _test_impl
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1275, in _dispatch
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 206, in start_evaluating
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1286, in run_stage
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1334, in _run_evaluate
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\base.py", line 151, in run
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 131, in on_run_end
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 231, in _evaluation_epoch_end
File "E:\HTS\HTS\sed_model.py", line 201, in test_epoch_end
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 748, in get_world_size
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 274, in _get_group_size
File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Process finished with exit code 1
I run the test process on esc-50 dateset with a signal GPU, something wrong with it, the result like above , I Commented out the program about gather_pred、gather_target and some about dist.xxxxx, It works .
I want to know that can I run the code though signal GPU with changes above