Open koush opened 1 week ago
Hi,
Could you please tell me your Git commit version? I recently fixed a very similar issue in commit 309271089a6f916197e4e7977f77738ba1521bfb.
Best regards, Henry Tsui
commit 2522f723d0db5c72a6e49a7331b844290ef0af34 (HEAD -> main, origin/main, origin/TEST, origin/HEAD)
Author: henrytsui000 <henrytsui000@gmail.com>
Date: Tue Nov 5 14:43:04 2024 +0800
β
[Pass] test in multiclass label&dynamic shape
Updated to 959b9b05667f6b9a1f349bc2c9843d039e405f60, issue persists.
Hi,
Can you try turning off the dynamic_shape
setting in yolo/config/task/validation.yaml
? You can do this by modifying the configuration as follows:
task: validation
data:
...
dynamic_shape: False
...
Alternatively, you can disable it during training with the following command:
python yolo/lazy.py task=train ... task.validation.data.dynamic_shape=False
I suspect the issue is caused by the sampler and the dynamic_shape
setting. Turning it off will disable the auto-adjustment of the input shape in the validation phase. While this might result in a slightly lower mAP, it will enable multiple GPU validation.
If you need a higher-performance model, you can perform validation after training using a single GPU after trainingβor you can wait for me to find time to fix this properly.
Best regards,
Henry Tsui
That extra command line parameter seems to have suppressed the issue.
Training completes 1 epoch, performs the validation step seemingly with no error, and then hangs.
Describe the bug
Running on system with multiple GPU fails.
To Reproduce
Setup system with 2 GPUs.
Run the training command:
Expected behavior
Training proceeds.
Screenshots
If applicable, add screenshots to help explain your problem.
System Info (please complete the following ## information):
Additioal Nodes
Modifying lazy.py to use num_devices=1, num_nodes=1 works as expected. DDP is failing.