Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior

zyyyz commented 2 months ago

Hi, I’d like to commend you all on this fantastic project—it's truly impressive. I have a few questions and would appreciate any guidance:

Could you provide some details regarding the computational cost of training? Specifically, how much data was used, what type of GPUs were utilized, and how long the training process took?
When following the Accelerate Configuration Example, I encountered an issue when training on 2 H100 setup. The error message I received was:
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16.
To resolve this, I had to modify the line dit.to(accelerator.device) (line 108 in train_flux_deepspeed_controlnet.py) to dit.to(accelerator.device, dtype=weight_dtype), after which training proceeded normally. I'm not entirely sure what caused this discrepancy—any insight into the root of the issue?
I'm training ControlNet on a small dataset of around 3,500 images. Throughout training, the loss seems to remain within the range of 0.5-0.6 after 10k steps. Is this behavior typical, or should I be concerned that something might be off?

I really appreciate any help or advice you can offer. Thanks again for the amazing work you're doing!

bonlime commented 2 weeks ago

@zyyyz have you been able to successfully train model using code from this repo?

tianqyun111 commented 1 week ago

Is there any new progress? i trained pose controlnet with 50000 images,but when inference, even i set strength to 1,The image does not have any guided effect.Anyone can help me?

XLabs-AI / x-flux

Questions Regarding Training Costs, Dtype Error on H100, and ControlNet Loss Behavior #111