Westlake-AI / openmixup

CAIRI Supervised, Semi- and Self-Supervised Visual Representation Learning Toolbox and Benchmark
https://openmixup.readthedocs.io
Apache License 2.0
629 stars 61 forks source link

[Bug] unrecognized arguments: --local-rank=0 #48

Closed leon-costa closed 1 year ago

leon-costa commented 1 year ago

Describe the bug

I followed the installation instructions in https://github.com/Westlake-AI/openmixup/blob/main/docs/en/install.md#install-openmixup and everything went well (except Apex but it's optional).

When I run the first Getting Started example command I get the following error:

$ bash tools/dist_train.sh configs/classification/imagenet/resnet/resnet50_rsb_a3_sz160_8xb256_ep100.py 1 --auto_resume
/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
Please install scikit-image.
Please install scikit-image with PyPi
usage: train.py [-h] [--work_dir WORK_DIR] [--resume_from RESUME_FROM] [--auto_resume] [--pretrained PRETRAINED] [--load_checkpoint LOAD_CHECKPOINT]
                [--gpus GPUS | --gpu_ids GPU_IDS [GPU_IDS ...] | --gpu-id GPU_ID] [--seed SEED] [--diff-seed] [--deterministic]
                [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] [--port PORT]
                config
train.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 865532) of binary: /home/leon/.conda/envs/openmixup2/bin/python
Traceback (most recent call last):
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/leon/.conda/envs/openmixup2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-11_18:06:40
  host      : leon
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 865532)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

To Reproduce

Follow the installation instructions and execute the example command as described above.

Post related information

  1. The output of pip list | grep "openmixup\|^torch"
openmixup               0.2.7       /home/leon/projects/openmixup
torch                   2.0.1
torchaudio              2.0.2
torchgeometry           0.1.2
torchvision             0.15.2
  1. Your config file if you modified it or created a new one.

No modified config. I just changed the 8 gpus to 1 gpu in the example command.

Additional context

I initially tried to install everything by following the instructions here but the last command python setup.py develop failed with this error:

AttributeError: module 'cv2' has no attribute '__version__'
Lupin1998 commented 1 year ago

Hi, @leon-costa, sorry for the late reply. I try to run bash tools/dist_train.sh configs/classification/imagenet/resnet/resnet50_rsb_a3_sz160_8xb256_ep100.py 1 --auto_resume and haven't found the error of error: unrecognized arguments: --local-rank=0. I suggest that you can run OpenMixup with PyTorch<=1.13.1 and check whether you are using the latest source code of OpenMixup, which I haven't found errors in installation and DDP training. Currently, OpenMixup has some errors in running with PyTorch==2.0.1. You can try the following scripts,

conda create -n openmixup python=3.8 pytorch=1.13 cudatoolkit=11.6 torchvision -c pytorch -y
conda activate openmixup
pip install openmim
mim install mmcv-full
pip install opencv-python
git clone https://github.com/Westlake-AI/openmixup.git
cd openmixup
python setup.py develop
leon-costa commented 1 year ago

Hi. Thank you for your reply.

Yes I'm on the latest commit on the main branch.

I tried your commands:

conda create -n openmixup python=3.8 pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia -y
conda activate openmixup
mim install mmcv-full
pip install opencv-python==4.5.4.60
git clone https://github.com/Westlake-AI/openmixup.git
cd openmixup
python setup.py develop

And it worked, I was able to start a training.

Lupin1998 commented 1 year ago

Thanks for your detailed solutions! @leon-costa👍 We will add a reference to this issue in install.md. To summarize, the main problems are attributed to PyTorch installation and the version of opencv-python.