Problem with big white spaces when training own model

bmaltais / kohya_ss

Apache License 2.0

9.54k stars 1.23k forks source link

Problem with big white spaces when training own model Hello,

I am currently training my own diffusion model. Everything is working fine, but when I use sounds smaller than 5 seconds (which means only a fraction of the training images is used), and I want to recreate it with my model, it fills the whole image instead of using, like in the original images, just a fraction. When converting to the sounds, it "sounds right," but it sounds just stretched. However, I want it to create the big white spaces as well. I am using over 1000 training images and trained over 500,000 iterations. Like I said, it "sounds right" to the prompt, but sounds stretched.

Example training data images:

Afrobeat clap, key A#, rhythmic, lively, festive, 1

EDM clap, , odyssey style, energetic, big, electronic - Kopie

Trap Clap, percussion, 808, energetic, modern, hiphop, 0

Example generated images (who should look like the training images):

OSV2_StationV0 017_20240229064404_060000_02

OSV2_StationV0 017_20240228232025_010000_02

Any idea how I can fix this or what I'm doing wrong? I am using kohya_ss webUI for training.

Thanks! :)

I'm sorry if I misunderstood, but my understanding is that you'd like to train an audio model? I'm not sure that kohya_ss would be the correct script to use for training. As it is right now, kohya_ss is more built towards image training, and I think the issue you are describing is related to the fact that when generating an image, it's very difficult train and generate images with high contrast.

If your objective is to train to output audio, I ponder if looking into training audio specific models might make more sense.

Otherwise, the only thing I can imagine that might work, if you are saying that the "audio" images are correct, but stretched out, would be to adjust the image width to match the length, eg, if you generating 5 sec length, manual adjust the generation to however many horizontal pixels the 5 sec length is supposed to be.

bmaltais / kohya_ss

Problem with big white spaces when training own model #2019