deep speed0.14.0
triton2.1.0
install torch-2.2.1+cu121-cp311-cp311-win_amd64.whl
Who can help? / 谁可以帮助到您?
finetune_demo: @1049451037
Information / 问题信息
[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
[2024-07-30 17:30:18,378] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-07-30 17:30:23,857] [WARNING] No training data specified
[2024-07-30 17:30:23,857] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters.
[2024-07-30 17:30:23,857] [INFO] using world size: 1 and model-parallel size: 1
[2024-07-30 17:30:23,857] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
Traceback (most recent call last):
File "D:\PycharmProjects\CogVLM-main\finetune_demo\finetune_cogagent_demo.py", line 260, in
args = get_args(args_list)
^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\sat\arguments.py", line 442, in get_args
initialize_distributed(args)
File "D:\conda3\envs\cogvlm\Lib\site-packages\sat\arguments.py", line 513, in initialize_distributed
torch.distributed.init_process_group(
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
defaultpg, = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
System Info / 系統信息
deep speed0.14.0
triton2.1.0 install torch-2.2.1+cu121-cp311-cp311-win_amd64.whl
Who can help? / 谁可以帮助到您?
finetune_demo: @1049451037
Information / 问题信息
Reproduction / 复现过程
[2024-07-30 17:30:18,378] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-07-30 17:30:23,857] [WARNING] No training data specified [2024-07-30 17:30:23,857] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters. [2024-07-30 17:30:23,857] [INFO] using world size: 1 and model-parallel size: 1 [2024-07-30 17:30:23,857] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) Traceback (most recent call last): File "D:\PycharmProjects\CogVLM-main\finetune_demo\finetune_cogagent_demo.py", line 260, in
args = get_args(args_list)
^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\sat\arguments.py", line 442, in get_args
initialize_distributed(args)
File "D:\conda3\envs\cogvlm\Lib\site-packages\sat\arguments.py", line 513, in initialize_distributed
torch.distributed.init_process_group(
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
defaultpg, = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\conda3\envs\cogvlm\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
Expected behavior / 期待表现
yes