XLabs-AI / x-flux

Apache License 2.0
1.56k stars 113 forks source link

Batch size > 1 got runtime error #39

Open tristanwqy opened 2 months ago

tristanwqy commented 2 months ago
torch.Size([4, 4096, 64]) torch.Size([4, 4096, 64]) torch.Size([4])
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 302, in <module>
[rank2]:     main()
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 227, in main
[rank2]:     x_t = (1 - t) * x_1 + t * x_0
[rank2]: RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 2

batch size = 4 and gradient acculmulation = 4, with 4 GPUs

log26 commented 2 months ago
torch.Size([4, 4096, 64]) torch.Size([4, 4096, 64]) torch.Size([4])
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 302, in <module>
[rank2]:     main()
[rank2]:   File "/home/ubuntu/flux_training/train_flux_lora_deepspeed.py", line 227, in main
[rank2]:     x_t = (1 - t) * x_1 + t * x_0
[rank2]: RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 2

batch size = 4 and gradient acculmulation = 4, with 4 GPUs

Have you identified the cause of the issue and found a solution?

[rank0]:     main()
[rank0]:   File "/mnt/bn/xuqin-lq/workspace/x-flux/train_flux_lora_deepspeed.py", line 241, in main
[rank0]:     x_t = (1 - t) * x_1 + t * x_0
[rank0]: RuntimeError: The size of tensor a (2) must match the size of tensor b (64) at non-singleton dimension 2