exx8 / differential-diffusion

329 stars 17 forks source link

Can this be adapted for SD3 release ? #28

Open vikm2o opened 1 week ago

exx8 commented 1 week ago

I believe it can, the algoirthm should be the same. I consider making a new release with the following diffusion models: SC, PixArt-Σ, SD3, and Hunyuan-DiT.

vikm2o commented 1 week ago

I tried making changes for SD3 but it failed here latents = original_with_noise[i] * mask + latents * (1 - mask)

prepare_latents in SD3 Pipeline (https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py) has this shape shape torch.Size([1, 16, 80, 120])

and if I change to latents = original_with_noise[:1]* mask + latents * (1 - mask) it doesn't work

my changes are here https://github.com/vikm2o/differential-diffusion

exx8 commented 1 week ago

Hi, Can you specify the dimension of: original_with_noise[i] ,mask, latents ? If I recall correctly the only difference dimensionwise, for SD3, is the dimension of the latent space which is 16 in SD3 instead of 4 for earlier versions.

Thanks!

vikm2o commented 1 week ago

latents shape torch.Size([1, 16, 80, 120]) original_with_noise shape torch.Size([1, 16, 80, 120]) masks torch.Size([120, 80, 120]) prepare_latents in SD3 Pipeline produces latents of this shape torch.Size([1, 16, 80, 120]) in this line https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py#L646

exx8 commented 1 week ago

The number of steps is 120? If not there might be an error in one of the broadcasting operations.

The step should be: latents = original_with_noise[i] * mask + latents * (1 - mask) original_with_noise should contain versions of the picture wth amount of noise corresponding to the different timesteps.

vikm2o commented 1 week ago

yes number of steps is 120 . retrieve_timesteps in this line https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py#L853 changes 200 to 120. Not sure why but this is what I observe.

vikm2o commented 1 week ago

original_with_noise shape for SD3 is torch.Size([1, 16, 80, 120]) for 200 steps. For SD2 it's torch.Size([201, 4, 80, 120]) for 200 steps. So it can't be indexed with original_with_noise[i]

exx8 commented 1 week ago

yes number of steps is 120 . retrieve_timesteps in this line https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_img2img.py#L853 changes 200 to 120. Not sure why but this is what I observe.

Yeah, this is a new function which does not exist in diffusers@0.19. I am not sure what it does. I opened an issue in diffusers: https://github.com/huggingface/diffusers/issues/8577 Maybe they can advise you?

exx8 commented 1 week ago

original_with_noise shape for SD3 is torch.Size([1, 16, 80, 120]) for 200 steps. For SD2 it's torch.Size([201, 4, 80, 120]) for 200 steps. So it can't be indexed with original_with_noise[i]

OK, if I understand correctly what is missing is creating a tensor with multiple noised version of the original image (original_with_noise should be torch.Size([201, 16, 80, 120]))

vikm2o commented 1 week ago

yes