torch.distributed.barrier(group=group)

amazon-science / polygon-transformer

Apache License 2.0

128 stars 8 forks source link

torch.distributed.barrier(group=group) #23

Closed CauchyFanUpdate closed 4 months ago

CauchyFanUpdate commented 8 months ago

I'm facing a persistent blocking issue with torch.distributed.barrier(group=group). What could be the cause of this?"

CauchyFanUpdate commented 8 months ago

I've noticed that both pretraining and finetuning get stuck at this point: 'trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.layers.0.downsample.reduction.bias <- decoder.cls_head.bias.' Can you explain the reason for this? I'm not sure why this is happening. I'm running it on 8 RTX 3090s.

joellliu commented 8 months ago

Hi, I am not sure what is happening here. Did you get any error messages? Maybe it is due to out-of-memory. You can try to reduce the batch size and see if it works.