Parallel training hangs in reduce_tensor method

NVlabs / GroupViT

Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.

https://arxiv.org/abs/2202.11094

Other

725 stars 52 forks source link

Parallel training hangs in reduce_tensor method #51

Open ayushi-3536 opened 1 year ago

ayushi-3536 commented 1 year ago

Hi,

I was training the network with validation on coco dataset using 8 gpus on a single node. It seems like the network hangs while using reduce_tensor method inside validate_cls(). Is there a solution known for this issue?

xvjiarui commented 1 year ago

Hi @ayushi-3536

I haven't encountered this issue.

Could you please provide your PyTorch version?

Maybe you could please try run validation first.