Closed Kthulhut closed 1 month ago
Actually, this argument (--local-rank
or --local_rank
) depends on your PyTorch version. For newer PyTorch versions, you should use --local_rank
as we do.
Thank you for the quick response.
Are you sure it's not the other way around? -> Maybe i have used a newer version and newer vesions should use --local-rank. I have tested with 2.4.1, 2.5 and latest nightly build. All with same Error. See the following from PyTorch Doku.
The PyTorch documentation for the latest stable release, version 2.5 (released a few hours ago), is the same as in the version 2.4 documentation:
https://pytorch.org/docs/stable/distributed.html#launch-utility
Changed in version 2.0.0: The launcher will passes the --local-rank=
argument to your script. From PyTorch 2.0.0 onwards, the dashed --local-rank is preferred over the previously used underscored --local_rank. For backward compatibility, it may be necessary for users to handle both cases in their argument parsing code. This means including both "--local-rank" and "--local_rank" in the argument parser. If only "--local_rank" is provided, the launcher will trigger an error: “error: unrecognized arguments: –local-rank=
”. For training code that only supports PyTorch 2.0.0+, including "--local-rank" should be sufficient.
also from the documentation (maybe solution):
import argparse parser = argparse.ArgumentParser() parser.add_argument("--local-rank", "--local_rank", type=int) args = parser.parse_args()
That means I might have used a newer version than you? Which version are you using? Is it possible for you to provide the versions of all the packages you're using? This would help avoid conflicts and ensure reproducibility.
Thank you for your reminder. Maybe I missed something. I used PyTorch 2.3 and torch.distributed.launch to launch the script. I have modified the argument to parser.add_argument("--local-rank", "--local_rank", type=int)
. Thank you for your advice.
You're welcome and thx. Now its working out of the box with PyTorch 2.5, too.
Hi, first of all, congratulations on the release.
I tried to run the code locally on my system with the command: "sh scripts/train.sh 1 29500" and the Pascal dataset. I only have one local GPU on my PC and have no experience with distributed training.
With line 29 in the file unimatch_v2.py, "parser.add_argument('--local_rank', default=0, type=int)", I get the error described below. The error is triggered on line 34 "args = parser.parse_args()".
When I change the line to "parser.add_argument('--local-rank', default=0, type=int)", the code seems to work, and the training starts. So, instead of "local_rank", changing it to "local-rank".
The line "unimatch_v2.py: error: unrecognized arguments: --local-rank=0" from the error log led me to change the name to "local-rank".
Is this a typo mistake in the code, or have I made a mistake elsewhere in the configuration of my system, perhaps related to distributed? Could the change possibly have a negative impact on other parts of the code?
The Error Output: