Batch size > 1 got runtime error

XLabs-AI / x-flux

Apache License 2.0

1.56k stars 113 forks source link

torch.Size([4, 4096, 64]) torch.Size([4, 4096, 64]) torch.Size([4]) [rank2]: Traceback (most recent call last): [rank2]: File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 302, in <module> [rank2]: main() [rank2]: File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 227, in main [rank2]: x_t = (1 - t) * x_1 + t * x_0 [rank2]: RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 2

torch.Size([4, 4096, 64]) torch.Size([4, 4096, 64]) torch.Size([4])
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 302, in <module>
[rank2]:     main()
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 227, in main
[rank2]:     x_t = (1 - t) * x_1 + t * x_0
[rank2]: RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 2

batch size = 4 and gradient acculmulation = 4, with 4 GPUs

Have you identified the cause of the issue and found a solution?

[rank0]:     main()
[rank0]:   File "/mnt/bn/xuqin-lq/workspace/x-flux/train_flux_lora_deepspeed.py", line 241, in main
[rank0]:     x_t = (1 - t) * x_1 + t * x_0
[rank0]: RuntimeError: The size of tensor a (2) must match the size of tensor b (64) at non-singleton dimension 2

XLabs-AI / x-flux

Batch size > 1 got runtime error #39