bug - Githubissues

Global seed set to 970131

each batch size: 8

INFO:root:total dataset size: 400

GPU available: True, used: True

TPU available: False, using: 0 TPU cores

IPU available: False, using: 0 IPUs

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Testing: 98%|█████████▊| 49/50 [00:03<00:00, 22.58it/s]Traceback (most recent call last):

File "E:/HTS/HTS/main.py", line 429, in

main()

File "E:/HTS/HTS/main.py", line 417, in main

test()

File "E:/HTS/HTS/main.py", line 247, in test

trainer.test(model, datamodule=audioset_data)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 911, in test

return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt

return trainer_fn(*args, **kwargs)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 954, in _test_impl

results = self._run(model, ckpt_path=self.tested_ckpt_path)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run

self._dispatch()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1275, in _dispatch

self.training_type_plugin.start_evaluating(self)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 206, in start_evaluating

self._results = trainer.run_stage()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1286, in run_stage

return self._run_evaluate()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1334, in _run_evaluate

eval_loop_results = self._evaluation_loop.run()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\base.py", line 151, in run

output = self.on_run_end()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 131, in on_run_end

self._evaluation_epoch_end(outputs)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 231, in _evaluation_epoch_end

model.test_epoch_end(outputs)

File "E:\HTS\HTS\sed_model.py", line 201, in test_epoch_end

gather_pred = [torch.zeros_like(pred) for _ in range(dist.get_world_size())]

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 748, in get_world_size

return _get_group_size(group)

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 274, in _get_group_size

default_pg = _get_default_group()

File "C:\Users\Administrator\anaconda3\envs\torchtf\lib\site-packages\torch\distributed\distributed_c10d.py", line 358, in _get_default_group

raise RuntimeError("Default process group has not been initialized, "

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

I run the test process on esc-50 dateset with a signal GPU, something wrong with it, the result like above , I Commented out the program about gather_pred、gather_target and some about dist.xxxxx, It works .

I want to know that can I run the code though signal GPU with changes above

I see. Thank you for pointing out one missing implementation in the single GPU.

I will update the code tomorrow, but you can follow below instructions to revise the code first.

I provide a ipynb called like trained_esc50.ipynb in the folder, you can follow it to train the esc-50 from scratch. This should work.

If you want to solve this problem in testing directly from checkpoint, you can direct to sed_model.py. There are two methods: valid_epoch_end and test_epoch_end.

You can see in the valid_epoch_end, there is a condition to judge if the model is running on single or multi_gpu (device_count). And if the model is running on multi gpus, we should use the gather function to gather all results from different gpus (I.e preds and targets). If not, using directly the preds and targets in this gpu (because there is only one gpu).

But in test_epoch_end, I forget to make this, that’s why you face a problem in testing. You can follow the code from valid_epoch_end to change the test_epoch_end. Or, like you said, comment all gather functions and dist functions, but change the comparison terms to be preds and targets variables. That should be fine.

I will update this code. If you finish it, you can also send a pull request, which makes me more convenient to change it.

RetroCirce / HTS-Audio-Transformer

bug #16