LiheYoung / UniMatch-V2

UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation
https://arxiv.org/abs/2410.10777
MIT License
60 stars 4 forks source link

Bug? Typo in parser.add_argument local-rank instead of local_rank #2

Closed Kthulhut closed 1 month ago

Kthulhut commented 1 month ago

Hi, first of all, congratulations on the release.

I tried to run the code locally on my system with the command: "sh scripts/train.sh 1 29500" and the Pascal dataset. I only have one local GPU on my PC and have no experience with distributed training.

With line 29 in the file unimatch_v2.py, "parser.add_argument('--local_rank', default=0, type=int)", I get the error described below. The error is triggered on line 34 "args = parser.parse_args()".

When I change the line to "parser.add_argument('--local-rank', default=0, type=int)", the code seems to work, and the training starts. So, instead of "local_rank", changing it to "local-rank".

The line "unimatch_v2.py: error: unrecognized arguments: --local-rank=0" from the error log led me to change the name to "local-rank".

Is this a typo mistake in the code, or have I made a mistake elsewhere in the configuration of my system, perhaps related to distributed? Could the change possibly have a negative impact on other parts of the code?

The Error Output:

/bin/bash /home/mysystem/PycharmProjects/UniMatch-V2/scripts/train.sh 1 29500
/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  main()
xFormers not available
xFormers not available

# Added these Outputs in the Code for Debugging:
CUDA available: True      # print("CUDA available:", torch.cuda.is_available())
Number of GPUs: 1         # print("Number of GPUs:", torch.cuda.device_count())
Current GPU: 0                 # print("Current GPU:", torch.cuda.current_device())
LOCAL_RANK: 0                # print("LOCAL_RANK:", os.environ.get("LOCAL_RANK", "0"))

usage: unimatch_v2.py [-h] --config CONFIG --labeled-id-path LABELED_ID_PATH
                      --unlabeled-id-path UNLABELED_ID_PATH --save-path
                      SAVE_PATH [--local_rank LOCAL_RANK] [--port PORT]
unimatch_v2.py: error: unrecognized arguments: --local-rank=0
E1017 15:16:48.952000 140450266624128 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 72605) of binary: /home/mysystem/PycharmProjects/UniMatch-V2/.venv/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 208, in <module>
    main()
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/typing_extensions.py", line 2853, in wrapper
    return arg(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 204, in main
    launch(args)
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launch.py", line 189, in launch
    run(args)
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mysystem/PycharmProjects/UniMatch-V2/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
unimatch_v2.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-17_15:16:48
  host      : mysystem-System-Product-Name
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 72605)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
LiheYoung commented 1 month ago

Actually, this argument (--local-rank or --local_rank) depends on your PyTorch version. For newer PyTorch versions, you should use --local_rank as we do.

Kthulhut commented 1 month ago

Thank you for the quick response.

Are you sure it's not the other way around? -> Maybe i have used a newer version and newer vesions should use --local-rank. I have tested with 2.4.1, 2.5 and latest nightly build. All with same Error. See the following from PyTorch Doku.

The PyTorch documentation for the latest stable release, version 2.5 (released a few hours ago), is the same as in the version 2.4 documentation:

https://pytorch.org/docs/stable/distributed.html#launch-utility

Changed in version 2.0.0: The launcher will passes the --local-rank= argument to your script. From PyTorch 2.0.0 onwards, the dashed --local-rank is preferred over the previously used underscored --local_rank.

For backward compatibility, it may be necessary for users to handle both cases in their argument parsing code. This means including both "--local-rank" and "--local_rank" in the argument parser. If only "--local_rank" is provided, the launcher will trigger an error: “error: unrecognized arguments: –local-rank=”. For training code that only supports PyTorch 2.0.0+, including "--local-rank" should be sufficient.

also from the documentation (maybe solution):

import argparse parser = argparse.ArgumentParser() parser.add_argument("--local-rank", "--local_rank", type=int) args = parser.parse_args()

That means I might have used a newer version than you? Which version are you using? Is it possible for you to provide the versions of all the packages you're using? This would help avoid conflicts and ensure reproducibility.

LiheYoung commented 1 month ago

Thank you for your reminder. Maybe I missed something. I used PyTorch 2.3 and torch.distributed.launch to launch the script. I have modified the argument to parser.add_argument("--local-rank", "--local_rank", type=int). Thank you for your advice.

Kthulhut commented 1 month ago

You're welcome and thx. Now its working out of the box with PyTorch 2.5, too.