ValueError: Default process group has not been initialized, please make sure to call init_process_group

MKaczkow commented 7 months ago

First of all, thanks for providing this code 😄

tl;dr

I am getting ValueError when trying to run eval on iNat21 dataset with python -m evals.main --fname configs/evals/vitl16_inat.yaml --devices cuda:0 and running out of ideas how to fix it.

Config values

I try to run eval, using single GPU, local machine
dataset I use is iNaturalist-2021

configs\evals\vith16_inat.yaml look like this:

nodes: 8
tasks_per_node: 8
tag: inat-16f
eval_name: image_classification_frozen
resume_checkpoint: false
data:
root_path: D:\__repos\jepa\data
image_folder: inat
num_classes: 10000
resolution: 224
dataset_name: iNat21
optimization:
num_epochs: 20
batch_size: 16
weight_decay: 0.001
lr: 0.001
start_lr: 0.001
final_lr: 0.0
warmup: 0.
use_bfloat16: true
pretrain:
model_name: vit_large
checkpoint_key: target_encoder
clip_duration: null
frames_per_clip: 16
tubelet_size: 2
uniform_power: true
use_sdpa: true
use_silu: false
tight_silu: false
patch_size: 16
folder: D:\__repos\jepa\models
checkpoint: vitl16.pth.tar  # name of pretrained model file inside folder
write_tag: jepa

packages' versions:

Package            Version
------------------ ------------
certifi            2024.2.2
charset-normalizer 3.3.2
colorama           0.4.6
filelock           3.9.0
fsspec             2024.3.1
huggingface-hub    0.22.2
idna               3.7
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
networkx           3.2.1
numpy              1.26.3
packaging          24.0
pillow             10.2.0
pip                22.0.4
PyYAML             6.0.1
requests           2.31.0
safetensors        0.4.3
setuptools         58.1.0
sympy              1.12
timm               0.9.16
torch              2.2.2+cu118
torchvision        0.17.2+cu118
tqdm               4.66.2
typing_extensions  4.8.0
urllib3            2.2.1

I have tried

as far as I understand, mentioned problem is caused by torch.distributed being available, but not initialized, but I haven't been able to pinpoint where this happens

I've run this little 'checklist' from SO and got

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3060'

this pytorch forum post suggest wrong usage of DistributedDataParallel is the root cause, but I haven't found it in the repo
this GitHub issue suggested SyncBatchNorm behaving in unexpected way, when running on single GPU, but this has already been fixed in this PR
this problem also seems similar to this and this issues

I've also tried commenting-out this lines:

world_size, rank = init_distributed(rank_and_world_size=(rank, world_size))
logger.info(f'Running... (rank: {rank}/{world_size})')

in evals.main, to avoid using of init_distributed function

Full stacktrace

(venv) PS D:\__repos\jepa> python -m evals.main --fname configs/evals/vitl16_inat.yaml
INFO:root:called-params configs/evals/vitl16_inat.yaml
INFO:root:loaded params...
{   'data': {   'dataset_name': 'iNat21',
                'image_folder': 'inat',
                'num_classes': 10000,
                'resolution': 224,
                'root_path': 'D:\\__repos\\jepa\\data'},
    'eval_name': 'image_classification_frozen',
    'nodes': 8,
    'optimization': {   'batch_size': 16,
                        'final_lr': 0.0,
                        'lr': 0.001,
                        'num_epochs': 20,
                        'start_lr': 0.001,
                        'use_bfloat16': True,
                        'warmup': 0.0,
                        'weight_decay': 0.001},
    'pretrain': {   'checkpoint': 'vitl16.pth.tar',
                    'checkpoint_key': 'target_encoder',
                    'clip_duration': None,
                    'folder': 'D:\\__repos\\jepa\\models',
                    'frames_per_clip': 16,
                    'model_name': 'vit_large',
                    'patch_size': 16,
                    'tight_silu': False,
                    'tubelet_size': 2,
                    'uniform_power': True,
                    'use_sdpa': True,
                    'use_silu': False,
                    'write_tag': 'jepa'},
    'resume_checkpoint': False,
    'tag': 'inat-16f',
    'tasks_per_node': 8}
D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
  warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
INFO:root:Rank: 0. Distributed training not available Distributed package doesn't have NCCL built in
INFO:root:Running... (rank: 0/1)
INFO:root:Running evaluation: image_classification_frozen
INFO:root:SLURM vars not set (distributed training not available)
INFO:root:Initialized (rank/world-size) 0/1
INFO:root:Loading pretrained model from D:\__repos\jepa\models\vitl16.pth.tar
VisionTransformer(
  (patch_embed): PatchEmbed3D(
    (proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
  )
  (blocks): ModuleList(
    (0-23): 24 x Block(
      (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=1024, out_features=3072, bias=True)
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=1024, out_features=1024, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      (mlp): MLP(
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (act): GELU(approximate='none')
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (drop): Dropout(p=0.0, inplace=False)
      )
    )
  )
  (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
INFO:root:loaded pretrained model with msg: <All keys matched successfully>
INFO:root:loaded pretrained encoder from epoch: 300
 path: D:\__repos\jepa\models\vitl16.pth.tar
INFO:root:implementing auto-agument strategy
INFO:root:data-path D:\__repos\jepa\data\inat\train/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:data-path D:\__repos\jepa\data\inat\val/
INFO:root:Initialized ImageFolder
INFO:root:ImageFolder dataset created
INFO:root:ImageFolder unsupervised data loader created
INFO:root:Dataloader created... iterations per epoch: 31250
INFO:root:Using AdamW
Process Process-1:
Traceback (most recent call last):
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Users\Maciek\AppData\Local\Programs\Python\Python310\lib\multiprocessing\process.py", line 108, in run    self._target(*self._args, **self._kwargs)
  File "D:\__repos\jepa\evals\main.py", line 57, in process_main
    eval_main(params['eval_name'], args_eval=params)
  File "D:\__repos\jepa\evals\scaffold.py", line 22, in main
    return importlib.import_module(f'evals.{eval_name}.eval').main(
  File "D:\__repos\jepa\evals\image_classification_frozen\eval.py", line 201, in main
    classifier = DistributedDataParallel(classifier, static_graph=True)
  File "D:\__repos\jepa\venv\lib\site-packages\torch\nn\parallel\distributed.py", line 731, in __init__
    self.process_group = _get_default_group()
  File "D:\__repos\jepa\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

LangDaniel commented 6 months ago

you could just remove the lines initializing the DistributedDataParallel in app/vjepa/train.py , i.e. lines 295-297, as a quick fix.

MKaczkow commented 6 months ago

It didn't help, I am afraid, still getting:

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

LangDaniel commented 6 months ago

Did you also try to remove it from the eval scripts, i.e. line 201 in evals/image_classification_frozen/eval.py?

krstevskipetar commented 6 months ago

I faced this same issue using a single GPU on one machine, I got it working by changing the port and explicitly defining the rank and world size. For evaluation you can edit line 131 in evals/video_classification_frozen/eval.py to be world_size, rank = init_distributed(port=12321, rank_and_world_size=(0, 1))

facebookresearch / jepa