单卡下运行pretrain.py 报错 Default process group has not been initialized, please make sure to call init_process_group.

DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库；24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.

MIT License

2.42k stars 296 forks source link

单卡下运行pretrain.py 报错 Default process group has not been initialized, please make sure to call init_process_group. #18

Open TristanShao opened 1 year ago

TristanShao commented 1 year ago

是因为单卡运行了多卡模式吗。但是不知道怎么改。

TristanShao commented 1 year ago

奇怪，把ddp输出也是false

yiyiyichen1 commented 1 year ago

pretrain里加上下面代码 import torch.distributed as dist dist.init_process_group('gloo', init_method='file:///tmp/somefile', rank=0, world_size=1)

vaderyang commented 12 months ago

删除get_rank那段出错的if，直接存就可以

Niculuse commented 12 months ago

单卡训练，注释下面这行 train_sampler = torch.utils.data.distributed.DistributedSampler(train_ds)

然后把这里的sampler设置为None

train_loader = torch.utils.data.DataLoader(
    train_ds,
    batch_size=batch_size,
    pin_memory=False,
    drop_last=False,
    shuffle=False,        
    num_workers=4,
    sampler=None
)

Vincent-ZHQ commented 11 months ago

单卡运行也按照多卡运行的命令写，指定参数为1就好了，不要直接python pretrain.py

van-68 commented 8 months ago

单卡运行也按照多卡运行的命令写，指定参数为1就好了，不要直接python pretrain.py

你好，请问，这个具体是怎么执行和设置呢，谢谢。

zerozhoujie commented 7 months ago

加载数据下面加上这几行

if ddp: train_sampler = torch.utils.data.distributed.DistributedSampler(train_ds) else: train_sampler = None

Tonikroosliruhao commented 6 months ago

单卡运行也按照多卡运行的命令写，指定参数为1就好了，不要直接python pretrain.py

你好，请问，这个具体是怎么执行和设置呢，谢谢。

torchrun --standalone --nproc_per_node=1 pretrain.py

Zha-Miku commented 5 months ago

删除get_rank那段出错的if，直接存就可以

实测运行成功

具体操作：直接注释掉if torch.distributed.get_rank() == 0:这行就行了

系统：win11，py3.10，3050m，cu118

就是有点不理解具体为啥原因，

代码本来应该是在Linux上面运行的，而且命令只能在Linux上可行，但是win上就不行了，不理解

HildaM commented 5 months ago

按照大佬们的修改，注释掉多卡训练的内容，还是会出现以下的错误： 38e2c2c015ff0943f321e20f45529ae7