czczup / ViT-Adapter

[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
https://arxiv.org/abs/2205.08534
Apache License 2.0
1.18k stars 130 forks source link

Retraining not starting, exceptions and no GPU usage #78

Open db141 opened 1 year ago

db141 commented 1 year ago

Hi, thanks for sharing your great work! I tried to run a retraining on the cityscapes dataset. But unfortunately, it get's stuck after throwing several exceptions and does not use any GPU at all. What can I do? it's not really raising an error, just not producing any output at all/getting stuck.

thanks and best regards daboh

  1. config:
    
    #!/bin/bash

CONFIG=configs/cityscapes/mask2former_beit_adapter_large_896_80k_cityscapes_ss.py GPUS=2 PORT=${PORT:-29300}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \

python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ train.py $CONFIG --launcher pytorch --deterministic ${@:3}


Console output:
```Traceback (most recent call last):
  File "train.py", line 215, in <module>
    self.backbone = builder.build_backbone(backbone)
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmseg/models/builder.py", line 20, in build_backbone
    return BACKBONES.build(cfg)
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
OSError: BEiTAdapter: pretrained/beit_large_patch16_224_pt22k_ft22k.pth is not a checkpoint file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 215, in <module>
        main()main()

  File "train.py", line 171, in main
  File "train.py", line 171, in main
        model = build_segmentor(model = build_segmentor(

  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmseg/models/builder.py", line 48, in build_segmentor
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmseg/models/builder.py", line 48, in build_segmentor
        return SEGMENTORS.build(return SEGMENTORS.build(

  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 212, in build
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 212, in build
        return self.build_func(*args, **kwargs, registry=self)return self.build_func(*args, **kwargs, registry=self)

  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
        return build_from_cfg(cfg, registry, default_args)return build_from_cfg(cfg, registry, default_args)

  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
  File "/anaconda/envs/jupyter_env/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
        raise type(e)(f'{obj_cls.__name__}: {e}')raise type(e)(f'{obj_cls.__name__}: {e}')

OSErrorOSError: : EncoderDecoderMask2Former: BEiTAdapter: pretrained/beit_large_patch16_224_pt22k_ft22k.pth is not a checkpoint fileEncoderDecoderMask2Former: BEiTAdapter: pretrained/beit_large_patch16_224_pt22k_ft22k.pth is not a checkpoint file

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9630) of binary: /anaconda/envs/jupyter_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29300
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qeec92im/none_cwad05sl/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_qeec92im/none_cwad05sl/attempt_1/1/error.json

nvidia-smi:


| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   23C    P0    26W / 250W |      4MiB / 16160MiB |      **0%**      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000002:00:00.0 Off |                  Off |
| N/A   22C    P0    23W / 250W |      4MiB / 16160MiB |      **0%**      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  **No running processes found**                                                 |
+-----------------------------------------------------------------------------+```
db141 commented 1 year ago

found the link to 'beit_large_patch16_224_pt22k_ft22k.pth' in the config, downloaded and placed it into the pretrained folder. Now Retraining is starting and CUDA is running out of memory...which I expected. :)

According to your paper you used A100 GPUs, in your repo you mention usage of 8 GPUs on one note. Does that mean you went into the training with 8*A100?

czczup commented 1 year ago

found the link to 'beit_large_patch16_224_pt22k_ft22k.pth' in the config, downloaded and placed it into the pretrained folder. Now Retraining is starting and CUDA is running out of memory...which I expected. :)

According to your paper you used A100 GPUs, in your repo you mention usage of 8 GPUs on one note. Does that mean you went into the training with 8*A100?

Yes, we use 8 * A100 to train the model.