Open flycser opened 2 years ago
Hi, have you solved this problem yet? I met a same one.
Update the torch version and try this command : torchrun --nproc-per-node 2 --nnodes 2 --node_rank 0 --master_addr 'xxxxxx" --master_port 12345 fairseq_cli/hydra_train.py xxxxxxxxxxxxxx
@Dawn-970 Hi, I have met with same issue, can it be solved?
Its so difficult!
@Dawn-970 Hi, I have met with same issue, can it be solved?
use pytorch -m torch.distributed.lauch --use_env and set cfg.distributed_training.device_id by os.environ['LOCAL_RANK']
🐛 Bug
When I ran a model in a distributed model(2 nodes, each node with 2 GPUs) via hydra_train.py. the hydra can not accept arguments starting with "--", while torch.distributed.launch pass a argument as "--local_rank=0", which will raise a "unrecognized argument" error. I ran the model via command line, it works because the argparser can recognize arguments starting with "--"
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Code sample
Expected behavior
Environment
pip
, source):Additional context