Can not use vae_cache_preprocess for Training

bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.

GNU Affero General Public License v3.0

1.81k stars 172 forks source link

Can not use vae_cache_preprocess for Training #522

Closed baizh0u closed 4 months ago

baizh0u commented 5 months ago

When using vae_cache_preprocess for training, "Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set." will occur.. Don't know why the script will not encode the images in the buckets processors.

infos like :2024-06-23 01:56:56,732 [INFO] (VAECache) Bucket 1.46 caching results: {'not_local': 2392, 'already_cached': 0, 'cached': 0, 'total': 2392}

But if I turn off the vae_cache_preprocess into False, the image will encode during training, that works....but too slow.

bghira commented 5 months ago

i need more info, lol are you using deepspeed, what model are you training etc. debug logs would be great too

baizh0u commented 4 months ago

20240623-143624 I uploaded a screenshot for the bugs. Yes, I am using deepspeed for training for sdxl, set vae_cache_preprocess to True. After the cache all the textembed, this bug will show up. If I set vae_cache_preprocess to False, The whole training processor works. I think mabybe the preprocessor for vae cache and textembed cache can be simplify a little bit? btw. There is a small bug for deepspeed traing but not releated to this one, I will open a PR tomorrow for that issuse.

bghira commented 4 months ago

i think this is because deepspeed has num_processes > 1

baizh0u commented 4 months ago

you mean if we use distributied training, this issuse will occur? Are you plan to fix this issuse?

bghira commented 4 months ago

without deepspeed, distributed training actully doesnt have this problem. are you using one or multiple gpus on this one system?

bghira commented 4 months ago

it is easy to fix i just need to know the semantics of the setup for deepspeed a bit better.

if you are just using a single gpu the problem is pretty clear but multiple gpus makes it muddied

there are about 2392 counted as not_local whoch means it divided the work up as if multiple hosts or gpus are going to participate in the vae caching

so deepspeed seems to complicate this aspect but i am not sure how yet. it likely might be a bug in Accelerate too

baizh0u commented 4 months ago

yes, I am using multi-gpu for training, so if we are not using deepspeed for distributed training, just using the FSDP for distributied training?? But I saw your code inside train_sdxl.py, if we are not using deepspeed, the optimizer setup should change??? because the BF16adamW is only for deepspeed???

bghira commented 4 months ago

deepspeed uses its own C optimiser that's based on AdamW.

everything else uses pure bf16 with stochastic rounding, as per google's "rethinking mixed precision" (name maybe incorrect) paper

setup for multi-gpu training is DDP, not FSDP which isn't supported by ST. I haven't tried deepspeed with multi-gpu training.

are the 2392 images the entire dataset or just a slice of it?

baizh0u commented 4 months ago

I think the 2392 images is the one of the aspect bucket. DDP you mean the DDP from pytorch? because I run the tarining script through accelerate launch, and the deepspeed config will setup by accelerate config. So if I want to advoid this issuse, I should disabled deepspeed using accelerate config, and just use the normal multi-gpu distributed training by accelerate itselft? btw. let me post some of my settings, inculde dataset backend and some of the args (I change the args structure of your original script from CLI mod).

baizh0u commented 4 months ago

backend config is here screenshot-20240623-202748

bghira commented 4 months ago

yeah if you can disable deepspeed and just use NUM_PROCESSES=8 in the sdxl-env.sh script you'll be better off

baizh0u commented 4 months ago

ok, let me try, thanks.

bghira commented 4 months ago

for issue reports here i really only want to handle issues from people using this actual codebase without any added augmentations. it's hard to know what leads the issue, and the train_sdxl.sh script to launch with has safeguards in it that the main scripts do not.