bmaltais / kohya_ss

Apache License 2.0
9.7k stars 1.25k forks source link

Technical problem apparently. #2933

Open AbstractEyes opened 3 weeks ago

AbstractEyes commented 3 weeks ago
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:16<00:00, 1035.63it/s]
[rank3]:[W1028 04:25:49.100816810 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potent
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:16<00:00, 1035.09it/s]
[rank1]:[W1028 04:25:50.942893427 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potent
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:16<00:00, 1045.14it/s]
[rank2]:[W1028 04:25:51.790205331 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potent
ially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.:06, 732.35it/s]
[rank0]:[E1028 04:35:51.157226699 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600
000) ran for 600003 milliseconds before timing out.�������������������������������������������������������������������������������                     | 13162/16820 [00:12<00:04, 840.00it/s]
[rank2]:[E1028 04:35:51.164827661 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600
000) ran for 600017 milliseconds before timing out.�����������������������������������������������������������������������������������                 | 14806/16820 [00:13<00:02, 917.64it/s]
[rank2]:[E1028 04:35:51.169405690 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL
 work: 1, last completed NCCL work: -1.��������������������������������������������������������������������������������������������                   | 14586/16820 [00:13<00:01, 1147.46it/s]
[rank0]:[E1028 04:35:51.169436499 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL
 work: 1, last completed NCCL work: -1.������������������������������������������������������������������������������������������                     | 14349/16820 [00:13<00:02, 1136.86it/s]
[rank1]:[E1028 04:35:51.180861998 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600
000) ran for 600022 milliseconds before timing out.��������������������������������������������������������������������                                | 11838/16820 [00:11<00:05, 960.14it/s]
[rank1]:[E1028 04:35:51.181138092 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL
 work: 1, last completed NCCL work: -1.������������������������������������������������������������������������������������                            | 10403/16820 [00:09<00:07, 868.73it/s]
[rank3]:[E1028 04:35:51.207392488 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600
000) ran for 600054 milliseconds before timing out.����������������������������������������������������������������                                    | 9563/16820 [00:09<00:05, 1238.47it/s]
[rank3]:[E1028 04:35:51.207669634 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL
 work: 1, last completed NCCL work: -1.����������������������������������������������������������������                                               | 10070/16820 [00:09<00:05, 1178.83it/s]
[rank0]:[E1028 04:35:52.386164949 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1..89it/s]
[rank0]:[E1028 04:35:52.386179797 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations
might run on corrupted/incomplete data.��������������������������������������������������������������������                                             | 8539/16820 [00:08<00:08, 973.56it/s]
[rank0]:[E1028 04:35:52.386184696 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.               | 11457/16820 [00:10<00:06, 812.12it/s]
[rank0]:[E1028 04:35:52.391365570 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collecti
ve operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.    | 9101/16820 [00:08<00:09, 797.34it/s]
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                 | 7965/16820 [00:07<00:10, 844.64it/s]
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7344490e6446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)0 [00:06<00:08, 1151.85it/s]
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7343fe3cc762 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)���������������������������������������������������                                                        | 7216/16820 [00:06<00:08, 1132.78it/s]
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7343fe3d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)820 [00:08<00:06, 1172.30it/s]
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7343fe3d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)20 [00:08<00:06, 1085.16it/s]
frame #4: <unknown function> + 0x145c0 (0x73444924d5c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)                  | 6001/16820 [00:05<00:12, 837.14it/s]
frame #5: <unknown function> + 0x94ac3 (0x734449aa0ac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                              | 8276/16820 [00:07<00:10, 803.33it/s]
frame #6: clone + 0x44 (0x734449b31bf4 in /lib/x86_64-linux-gnu/libc.so.6)����������                                                                    | 8941/16820 [00:07<00:09, 862.47it/s]
 40%|������������������������������������������������������������������������������                                                                     | 8837/16820 [00:07<00:09, 809.68it/s]
terminate called after throwing an instance of 'c10::DistBackendError'������������                                                                      | 5547/16820 [00:05<00:14, 800.70it/s]
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=AL
LREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.                                                      | 6541/16820 [00:05<00:11, 856.81it/s]
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                | 7507/16820 [00:06<00:08, 1111.95it/s]frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7344490e6446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)0 [00:06<00:08, 1144.07it/s]
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7343fe3cc762 in /workspace/kohya_ss/venv/lib/python3.
10/site-packages/torch/lib/libtorch_cuda.so)�������������������������������                                                                            | 4803/16820 [00:04<00:10, 1156.30it/s]frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7343fe3d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)820 [00:05<00:10, 1006.27it/s]
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7343fe3d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)820 [00:05<00:11, 878.93it/s]
frame #4: <unknown function> + 0x145c0 (0x73444924d5c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)                  | 3443/16820 [00:03<00:13, 974.27it/s]
frame #5: <unknown function> + 0x94ac3 (0x734449aa0ac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                              | 5425/16820 [00:04<00:12, 943.89it/s]frame #6: clone + 0x44 (0x734449b31bf4 in /lib/x86_64-linux-gnu/libc.so.6)                                                                              | 4274/16820 [00:03<00:13, 930.69it/s]
 25%|������������������������������������������������������                                                                                            | 6200/16820 [00:05<00:09, 1062.62it/s]Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):                           | 2260/16820 [00:01<00:11, 1214.83it/s]
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7344490e6446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe42745 (0x7343fe042745 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x73444924d5c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x734449aa0ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x734449b31bf4 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E1028 04:35:52.535382062 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E1028 04:35:52.535421116 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operationsmight run on corrupted/incomplete data.
[rank2]:[E1028 04:35:52.535427478 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1028 04:35:52.536574798 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b56b2d8d446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7b56681cc762 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b56681d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b56681d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7b56b2ef45c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7b56b3747ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7b56b37d8bf4 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600017 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b56b2d8d446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7b56681cc762 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b56681d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b56681d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7b56b2ef45c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x7b56b3747ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7b56b37d8bf4 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b56b2d8d446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe42745 (0x7b5667e42745 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7b56b2ef45c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x7b56b3747ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7b56b37d8bf4 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E1028 04:35:52.551013263 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E1028 04:35:52.551048550 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operationsmight run on corrupted/incomplete data.
[rank1]:[E1028 04:35:52.551057968 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1028 04:35:52.552193445 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7811401c0446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7810f55cc762 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7810f55d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7810f55d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7811403275c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x781140b7aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x781140c0bbf4 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7811401c0446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7810f55cc762 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7810f55d3ba3 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7810f55d560d in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7811403275c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x94ac3 (0x781140b7aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x781140c0bbf4 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7811401c0446 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe42745 (0x7810f5242745 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7811403275c0 in /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x94ac3 (0x781140b7aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x781140c0bbf4 in /lib/x86_64-linux-gnu/libc.so.6)
AbstractEyes commented 3 weeks ago
00%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:13<00:00, 1227.28it/s]
[rank0]:[W1028 04:48:19.843866299 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potent
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:13<00:00, 1239.79it/s]
[rank2]:[W1028 04:48:19.034140532 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potent
ially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.01, 1233.43it/s]
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:13<00:00, 1238.39it/s]
[rank3]:[W1028 04:48:19.062080939 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potent
ially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.01, 1287.83it/s]
100%|�������������������������������������������������������������������������������������������������������������������������������������������������| 16820/16820 [00:13<00:00, 1241.05it/s]
[rank1]:[W1028 04:48:19.067909403 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potent
ially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.02, 1250.90it/s]
kohya-ss commented 3 weeks ago

When using multi-GPU training in Linux (or WSL), do not specify the --rdzv_backend=c10d option. If it does not work even if you remove this option, could you please share the command line options?