Closed darwinharianto closed 1 year ago
Sorry for answering my own question.
The error originates from SyncBatchNorm, SyncBatchNorm doesnt automatically change to BatchNorm for older torch version. Upgrading to later version fixes this (torch 2.0)
changing torch related constraint.txt to
torch 2.0.1
torchaudio 2.0.2
torchdata 0.6.1
torchtext 0.15.2
torchvision 0.15.2
fixes the problem
https://github.com/pytorch/pytorch/pull/89706#issue-1465104942
Running a training script using Linux on single GPU throws distributed training error, but if I train on a mac PC (no GPU) it trains fine
Looking at the error, it seems to be related with distributed training? how can I train with 1 gpu?