apple / ml-cvnets

CVNets: A library for training computer vision networks
https://apple.github.io/ml-cvnets
Other
1.76k stars 225 forks source link

Runtime error on single GPU Linux environment training #79

Closed darwinharianto closed 1 year ago

darwinharianto commented 1 year ago

Running a training script using Linux on single GPU throws distributed training error, but if I train on a mac PC (no GPU) it trains fine

2023-07-18 16:47:04 - LOGS    - Training took 00:00:03.57
Traceback (most recent call last):
  File "~anaconda3/envs/cvnets/bin/cvnets-train", line 8, in <module>
    sys.exit(main_worker())
  File "~/ml-cvnets/main_train.py", line 237, in main_worker
    main(opts=opts, **kwargs)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "~/ml-cvnets/main_train.py", line 176, in main
    training_engine.run(train_sampler=train_sampler)
  File "~/ml-cvnets/engine/training_engine.py", line 723, in run
    raise e
  File "~/ml-cvnets/engine/training_engine.py", line 606, in run
    train_loss, train_ckpt_metric = self.train_epoch(epoch)
  File "~/ml-cvnets/engine/training_engine.py", line 262, in train_epoch
    pred_label = self.model(samples)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "~/ml-cvnets/cvnets/models/segmentation/enc_dec.py", line 98, in forward
    enc_end_points: Dict = self.encoder.extract_end_points_all(
  File "~/ml-cvnets/cvnets/models/classification/base_image_encoder.py", line 233, in extract_end_points_all
    x = self._forward_layer(self.conv_1, x)  # 112 x112
  File "~/ml-cvnets/cvnets/models/classification/base_image_encoder.py", line 203, in _forward_layer
    else layer(x)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "~/ml-cvnets/cvnets/layers/conv_layer.py", line 255, in forward
    return self.block(x)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 735, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1067, in get_world_size
    return _get_group_size(group)
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 453, in _get_group_size
    default_pg = _get_default_group()
  File "~anaconda3/envs/cvnets/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 584, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Looking at the error, it seems to be related with distributed training? how can I train with 1 gpu?

darwinharianto commented 1 year ago

Sorry for answering my own question.

The error originates from SyncBatchNorm, SyncBatchNorm doesnt automatically change to BatchNorm for older torch version. Upgrading to later version fixes this (torch 2.0)

changing torch related constraint.txt to

torch                    2.0.1
torchaudio               2.0.2
torchdata                0.6.1
torchtext                0.15.2
torchvision              0.15.2

fixes the problem

https://github.com/pytorch/pytorch/pull/89706#issue-1465104942