Closed baizh0u closed 4 months ago
i need more info, lol are you using deepspeed, what model are you training etc. debug logs would be great too
I uploaded a screenshot for the bugs. Yes, I am using deepspeed for training for sdxl, set vae_cache_preprocess to True. After the cache all the textembed, this bug will show up. If I set vae_cache_preprocess to False, The whole training processor works. I think mabybe the preprocessor for vae cache and textembed cache can be simplify a little bit? btw. There is a small bug for deepspeed traing but not releated to this one, I will open a PR tomorrow for that issuse.
i think this is because deepspeed has num_processes > 1
you mean if we use distributied training, this issuse will occur? Are you plan to fix this issuse?
without deepspeed, distributed training actully doesnt have this problem. are you using one or multiple gpus on this one system?
it is easy to fix i just need to know the semantics of the setup for deepspeed a bit better.
if you are just using a single gpu the problem is pretty clear but multiple gpus makes it muddied
there are about 2392 counted as not_local whoch means it divided the work up as if multiple hosts or gpus are going to participate in the vae caching
so deepspeed seems to complicate this aspect but i am not sure how yet. it likely might be a bug in Accelerate too
yes, I am using multi-gpu for training, so if we are not using deepspeed for distributed training, just using the FSDP for distributied training?? But I saw your code inside train_sdxl.py, if we are not using deepspeed, the optimizer setup should change??? because the BF16adamW is only for deepspeed???
deepspeed uses its own C optimiser that's based on AdamW.
everything else uses pure bf16 with stochastic rounding, as per google's "rethinking mixed precision" (name maybe incorrect) paper
setup for multi-gpu training is DDP, not FSDP which isn't supported by ST. I haven't tried deepspeed with multi-gpu training.
are the 2392 images the entire dataset or just a slice of it?
I think the 2392 images is the one of the aspect bucket. DDP you mean the DDP from pytorch? because I run the tarining script through accelerate launch, and the deepspeed config will setup by accelerate config. So if I want to advoid this issuse, I should disabled deepspeed using accelerate config, and just use the normal multi-gpu distributed training by accelerate itselft? btw. let me post some of my settings, inculde dataset backend and some of the args (I change the args structure of your original script from CLI mod).
backend config is here
yeah if you can disable deepspeed and just use NUM_PROCESSES=8 in the sdxl-env.sh script you'll be better off
ok, let me try, thanks.
for issue reports here i really only want to handle issues from people using this actual codebase without any added augmentations. it's hard to know what leads the issue, and the train_sdxl.sh
script to launch with has safeguards in it that the main scripts do not.
When using vae_cache_preprocess for training, "Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set." will occur.. Don't know why the script will not encode the images in the buckets processors.
infos like :2024-06-23 01:56:56,732 [INFO] (VAECache) Bucket 1.46 caching results: {'not_local': 2392, 'already_cached': 0, 'cached': 0, 'total': 2392}
But if I turn off the vae_cache_preprocess into False, the image will encode during training, that works....but too slow.