Open scarbain opened 8 months ago
Hi,
It should be possible to train this with SDXl-Turbo! We just decided to use SD21-Turbo for compute reasons.
I think image-to-image translation tasks that do not have the same spatial layout would be challenging. We haven't tried such an experiment. Feel free to try it and report your results here!
-Gaurav
Hi,
It should be possible to train this with SDXl-Turbo! We just decided to use SD21-Turbo for compute reasons.
I think image-to-image translation tasks that do not have the same spatial layout would be challenging. We haven't tried such an experiment. Feel free to try it and report your results here!
-Gaurav
I am a beginner in diffusion era, and would like to ask , if the translation process would change the image resolution (such as 512->1024). So, at which step of the model (vae encoder, unet, vae decoder) should the resolution transformation be performed? Can you give me some idea, thanks
Alright great thanks @GaParmar, I think it's a little bit over my skills to adapt the training script for SDXL but I'll try that this week-end! If you have any advices or insights from your experience on 2.1 to get me started on it it'd be great. If I manage to get a working script I'll make a PR for it here!
I'm currently training 2.1 for lighting condition, I'm over 3500 steps right now, I'll get to the target 7K in a few hours !
Once it has reached 7K steps (or good results), do you think I'd gain to finetune my model at resolution 768 and then 1024 ? Do you think a few thousands steps at each resolution will be enough ? And at inference have you tried doing inference with more than 1 step ? Would it produce a better result ?
Thanks for taking the time to be so active with the posted issues :)
Hi, I would have done an implementation of SDXL into pix2pix and cyclegan in the branch stable-diffusion-xl of https://github.com/andjoer/img2img-turbo.git
Currently I only implemented it for the training algorithms. The additional resolution embeddings are embedded in the forward function, the prompt embeddings and additional prompt embeddings in the dataloader. But I might move latter two encodings into the forward function as well for the purpose of inference. The training works well for cycle gan, but the net has difficulties to understand the prompt in the pix2pix training.
I wonder if it would make sense to implement the refiner as well between the unet and the vae with frozen weights. I will try out this in the future. Theoretically it could be that it would be enough to place the refiner at inference between them, but I doubt that it will still be able to process the new latents of the base model.
I have not implemented negative prompts yet, but this would not be complicated to do now (I probably also will do it in the future). It would also be possible to add these after training at inference.
Hi @andjoer ! Thanks for your implementation ! Did you manage to train a paired or unpaired model with SDXL ? If yes, how long did it take and how much VRAM ?
Hi, the unpaired training works. The paired training is unstable and usually it falls in a local minimum I guess. I was lucky once and it created filled circles, but often it creates only outlines. As mentioned it does not learn to adapt the color to the text prompt. To debug this I did a quick own implementation of the vae and skip layers plus the unet. The unet receives no noisy latents but the latents from the vae and the timestep is always set to 999 as in the pix2pix-turbo implementation. I did not place it into a gan and use only l2 loss. This way it trains nice and stable with a frozen vae. My first guess was that the text and additional text embeddings are incorrect but according to my debug prints they are the same in my working implementation as in my not working adaption of pix2pix-turbo. When I train the weights of the vae it has the same problem as the pix2pix-turbo.
Hi @scarbain , I did now some further experimentation and it is just an issue with hyperparamters. Without lora and a learning rate of 5e-7 I was able to train the model on the Fill50k dataset. Unfortunately it so far only worked when using the l2 loss without the other losses (setting the other lambdas to 0). But the implementation itself would work. It could be that on an other dataset it will just work fine with the GAN and maybe lora.
Hi, I would have done an implementation of SDXL into pix2pix and cyclegan in the branch stable-diffusion-xl of https://github.com/andjoer/img2img-turbo.git嗨,我会在 https://github.com/andjoer/img2img-turbo.git 的分支 stable-diffusion-xl 中将 SDXL 实现到 pix2pix 和 Cyclegan 中
Currently I only implemented it for the training algorithms. The additional resolution embeddings are embedded in the forward function, the prompt embeddings and additional prompt embeddings in the dataloader. But I might move latter two encodings into the forward function as well for the purpose of inference. The training works well for cycle gan, but the net has difficulties to understand the prompt in the pix2pix training.目前我仅将其实现用于训练算法。附加解析嵌入嵌入在转发函数中,提示嵌入和附加提示嵌入嵌入在数据加载器中。但出于推理的目的,我也可能将后两种编码移至前向函数中。对于cycle gan来说训练效果很好,但是在pix2pix训练中网络很难理解提示。
I wonder if it would make sense to implement the refiner as well between the unet and the vae with frozen weights. I will try out this in the future. Theoretically it could be that it would be enough to place the refiner at inference between them, but I doubt that it will still be able to process the new latents of the base model.我想知道在 unet 和具有冻结重量的 vae 之间实施精炼机是否有意义。我将来会尝试这个。从理论上讲,将精炼器置于它们之间的推理可能就足够了,但我怀疑它仍然能够处理基本模型的新潜伏。
I have not implemented negative prompts yet, but this would not be complicated to do now (I probably also will do it in the future). It would also be possible to add these after training at inference.我还没有实现负面提示,但是现在做起来并不复杂(我将来可能也会这样做)。也可以在推理训练后添加这些。
Does SDXL-Turbo require a negative prompt?
Does SDXL-Turbo require a negative prompt?
It might depend on how the base-model is pertained if you need a negative prompt to achieve decent results but usually and technically not. If you add a negative prompt the unet just produces two latent space outputs. Then the difference between the original latent space and the latent space produced by the negative prompt is added (multiplied by a scaling factor) to the original latent space.
if negative_prompt is not None:
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
model_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
if you don't use a negative prompt you just don't do this step.
Does SDXL-Turbo require a negative prompt?
It might depend on how the base-model is pertained if you need a negative prompt to achieve decent results but usually and technically not. If you add a negative prompt the unet just produces two latent space outputs. Then the difference between the original latent space and the latent space produced by the negative prompt is added (multiplied by a scaling factor) to the original latent space.
if negative_prompt is not None: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) model_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
if you don't use a negative prompt you just don't do this step.
I understand. I mean, img2img-turbo is built on a pretrained turbo model. Perhaps Turbo doesn't require any negative prompt.
你好,我会在https://github.com/andjoer/img2img-turbo.git的 stable-diffusion-xl 分支中将 SDXL 实现到 pix2pix 和 cyclegan 中
目前我只为训练算法实现了它。额外的分辨率嵌入嵌入在前向函数中,提示嵌入和额外的提示嵌入嵌入在数据加载器中。但我可能会将后两种编码也移到前向函数中,以便进行推理。训练对于循环 gan 很有效,但网络很难理解 pix2pix 训练中的提示。
我想知道在 unet 和 vae 之间使用冻结权重实现精炼器是否有意义。我将来会尝试这一点。从理论上讲,将精炼器置于它们之间的推理位置可能就足够了,但我怀疑它是否仍然能够处理基础模型的新潜能。
我还没有实现负面提示,但现在做起来并不复杂(将来我可能也会这么做)。在推理训练之后也可以添加这些。
Does this support SDXL? It seems like the code is still turbo
Does this support SDXL? It seems like the code is still turbo
Both would work, SDXL and SDXL-Turbo. Turbo is one step inference and not a step by step diffusion. However you can train SDXL to do one step inference with the img2img-turbo code as well. But the concept of img2img-turbo relies on one step inference since otherwise you would need to know the resulting image before and add noise to it which is only possible in the paired training but not in the unpaired training. Since you also have just incremental denoising steps in a non-turbo diffusion process you could work in a GAN-setting but it will probably not improve something. If you are looking for multiple-step inference you might want to use instruct-pix2pix instead.
这支持 SDXL 吗?看来代码还是 turbo
SDXL 和 SDXL-Turbo 都可以。Turbo 是一步推理,而不是一步一步的扩散。但是,您也可以使用 img2img-turbo 代码训练 SDXL 进行一步推理。但 img2img-turbo 的概念依赖于一步推理,因为否则您需要事先知道生成的图像并向其中添加噪声,这只能在配对训练中实现,而在非配对训练中则不行。由于您在非涡轮扩散过程中也只有增量去噪步骤,因此您可以在 GAN 设置中工作,但它可能不会改善某些东西。如果您正在寻找多步推理,您可能希望改用 instruct-pix2pix。
I didn't see what code you modified
I didn't see what code you modified
Did you select the correct branch ("stable-diffusion-xl")? SDXL takes additional encodings for the dimensions of the image and encodings for a negative prompt (zeros if no negative prompt is provided) as input, that's the modification.
我没有看到你修改了什么代码
您是否选择了正确的分支(“stable-diffusion-xl”)?SDXL 将图像尺寸的额外编码和负提示的编码(如果没有提供负提示,则为零)作为输入,这就是修改。
Can this project do image head redrawing? Only anime the entire head
I didn't see what code you modified
Did you select the correct branch ("stable-diffusion-xl")? SDXL takes additional encodings for the dimensions of the image and encodings for a negative prompt (zeros if no negative prompt is provided) as input, that's the modification. @andjoer Hello, I couldn't find the branch of your modified code. Has it been deleted?
It would be here: https://github.com/andjoer/img2img-turbo/tree/stable-diffusion-xl you could clone using the terminal with: git clone -b stable-diffusion-xl https://github.com/andjoer/img2img-turbo.git
It would be here: https://github.com/andjoer/img2img-turbo/tree/stable-diffusion-xl you could clone using the terminal with: git clone -b stable-diffusion-xl https://github.com/andjoer/img2img-turbo.git
@andjoer Okay, thank you for your work! I also want to know, after replacing sd-turbo with sdxl-turbo, in which aspects have the image translation results improved? Is it in the understanding of text, the generation of details, or the desired style?
It would be here: https://github.com/andjoer/img2img-turbo/tree/stable-diffusion-xl you could clone using the terminal with: git clone -b stable-diffusion-xl https://github.com/andjoer/img2img-turbo.git @andjoer If I want to use the 'sdxl-turbo' model, how should the parameters for these two lines of the model be set?
parser.add_argument("--pretrained_model_name_or_path", default="stabilityai/sd-turbo") parser.add_argument("--is_sdxl", type=str2bool, default= False)
I just did a new commit with my latest updates, so if you did already, please clone it again. You should be in your img2img-turbo folder when launching the following (not in the source folder), then enter the following for training export ```
WANDB_API_KEY="xxxxxxx"
accelerate launch src/train_pix2pix_turbo.py \
--pretrained_model_name_or_path="stabilityai/sdxl-turbo" \
--is_sdxl=True \
--output_dir="output/pix2pix_turbo/yyyy" \
--dataset_folder="data/zzzz" \
--resolution=512 \
--lambda_clipsim=0 \
--lambda_gan=0 \
--lambda_lpips=0 \
--train_batch_size=2 \
--gradient_accumulation_steps=2 \
--learning_rate=5e-6 \
--viz_freq 50 \
--eval_freq 500 \
--num_samples_eval 200 \
--max_train_steps 50000 \
--checkpointing_steps 1000 \
--track_val_fid \
--report_to "wandb" --tracker_project_name "yyyy"
For inference (paired)
python src/inference_paired.py --input_image="yyyy.jpg" --prompt="your prompt" --model_name="stabilityai/sdxl-turbo" --is_sdxl=True --model_path="output/pix2pix_turbo/masks_2/checkpoints/xxxx.pkl" --output_dir="zzzz"
I just did a new commit with my latest updates, so if you did already, please clone it again. You should be in your img2img-turbo folder when launching the following (not in the source folder), then enter the following for training export ```
WANDB_API_KEY="xxxxxxx" accelerate launch src/train_pix2pix_turbo.py \ --pretrained_model_name_or_path="stabilityai/sdxl-turbo" \ --is_sdxl=True \ --output_dir="output/pix2pix_turbo/yyyy" \ --dataset_folder="data/zzzz" \ --resolution=512 \ --lambda_clipsim=0 \ --lambda_gan=0 \ --lambda_lpips=0 \ --train_batch_size=2 \ --gradient_accumulation_steps=2 \ --learning_rate=5e-6 \ --viz_freq 50 \ --eval_freq 500 \ --num_samples_eval 200 \ --max_train_steps 50000 \ --checkpointing_steps 1000 \ --track_val_fid \ --report_to "wandb" --tracker_project_name "yyyy"
For inference (paired)
python src/inference_paired.py --input_image="yyyy.jpg" --prompt="your prompt" --model_name="stabilityai/sdxl-turbo" --is_sdxl=True --model_path="output/pix2pix_turbo/masks_2/checkpoints/xxxx.pkl" --output_dir="zzzz"
I've got it, thanks a lot! But what I'm working on now is an unpaired task. Would the modified code be suitable for unpaired image-to-image translation as well?
Training works for both, but when I see it right I only updated the inference script for paired training. I did not do inference so far with an unpaired trained model but only trained it to the horse-zebra dataset to check if it works. By the way: there might be some issues with the unpaired training on some datasets as mentioned here https://github.com/GaParmar/img2img-turbo/issues/87#issuecomment-2337484317
Hello,
Do you have any insight or recommendations on trying to experiment with SDXL-Turbo ? Is there a reason beside training compute why you used 2.1 and not XL ?
Also, do you think this img2img can work for image translations that do not have the same spatial layouts in before and after ?
Thanks