cocktailpeanut / fluxgym

Dead simple FLUX LoRA training UI with LOW VRAM support
1.13k stars 90 forks source link

Underutilized GPU #160

Open GEGR667 opened 3 weeks ago

GEGR667 commented 3 weeks ago

Hi - More of a question about GPU utilization than an issue per se. (Congrats on the GUI, by the way - way more user friendly!!) After some initial teething problems, I've gotten Fluxgym to run without stalling, but I wonder if its underutilizing the GPU? I'm using a 12gb 3060 with a 12th-gen i5 on 64gb of system RAM, and training the model using Flux.1 Dev. I'm under no illusion of speed considering its a 3060. I'm training a Lora with 14 images cropped to 512x512, and default Loragym settings, with the 12gb VRAM option. On Windows Task Manager, it says the CPU is being used up to ~45%, while the GPU 1 NVIDIA GeForce RTX 3050 (using an e-gpu on Thunderbolt 4) is being used between 1% to 8%, with GPU VRAM usage between 1g/t to 7.0gb <<< seems a bit underutilized, no? I haven't screwed around with using FP8, a smaller model (ie. Flux Schnell, or NF4) or other advanced settings. So speed is what it is (its a 3060 12gb VRAM using Flux.1 Dev, after all), but are the GPU and VRAM being underutilized, and is there any way or setting to utilize more more of the GPU and VRAM processing power to speed things up? Sorry if there's a better place to post these kinds of questions. I saw some posts about the initial teething problems I was having, so I figured the fluxgym GIT issues might be a decent place to ask.

GEGR667 commented 3 weeks ago

Here's an update for those interested. I restarted the process, because I put too few sample image every N steps, which I figured were taking up more time that it was worth. I also set reduced the training images to just a face at 512x512 from a face and pose at 512x512, so likely fewer parameters for the LoRa to learn. 17 images being trained on now from 14 before, so more images, but fewer sample images to interrupt processing time with (again, this is a 3060 12gb). CPU and GPU / GPU memory usage is the same as before. I've read something about not using UNET to reduce time, though it's counterintuitive (to me) since that's the model its training the LoRa from. I'm still curious is there's another setting to better optimize processing speed with. I could have also used a smaller model (ie. NF4), which might have sped things up. Thoughts? After about 9 hours, I'm halfway through the training, with two LoRa checkpoints added (8 out of 16 epochs - I believe). Also, I noticed that the Fluxgym WebGUI stops updated after a certain point. No more sample images or logs are produced on the WebGUI, though the training run continues to post sample images to the file directory, so that's how I know the process hasn't stalled. Going on 10 hours now. I estimate I'll be done after 18 to 20 hours. Anybody here trained a decent LoRa on a 3060 / 12gb VRAM / 64g system RAM in considerably less time? Better prompting and fewer epochs?