During "shifted crop sampling with dilated sampling," small "foreground" objects can be injected into sharply focused background areas, as you point out in Figure 7 of the paper. You note correctly that, "the priors of current LDMs regarding image crops are solely derived from the general training scheme, which has already resulted in impressive performance. Training a bespoke LDM for a DemoFusion-like framework may be a promising direction to explore."
I'd like to suggest a possible alternative to training a bespoke model and I wonder if you (or anyone) has tried this yet.
During shifted crop sampling with dilated sampling, you could apply an IPAdapter to effectively re-condition diffusion on only the background portion of the global image that your sliding window is diffusing over. Although the diffusion model may not have been trained on a large number of samples of background imagery, if IPAdapter is applied to each patch, the model may nonetheless be guided toward generating background features rather than foreground features. IPAdapter is very fast, requiring only a single 224x224 area of pixels, which are passed through CLIPVision and then a small network to generate four 1024-wide vectors; these vectors are applied using cross-attention to the layers of the U-Net to guide diffusion.
An alternative to this approach with IPAdapter would be to apply a LoRA at each zoom level that has been specifically trained on zoomed samples to help guide diffusion to produce background-appropriate imagery. Loading in a few LoRAs for each zoom level might be less intensive than applying a bunch of IPAdapters, which would require running the latents through the VAE decoder to get an image for CLIPVision.
During "shifted crop sampling with dilated sampling," small "foreground" objects can be injected into sharply focused background areas, as you point out in Figure 7 of the paper. You note correctly that, "the priors of current LDMs regarding image crops are solely derived from the general training scheme, which has already resulted in impressive performance. Training a bespoke LDM for a DemoFusion-like framework may be a promising direction to explore."
I'd like to suggest a possible alternative to training a bespoke model and I wonder if you (or anyone) has tried this yet.
During shifted crop sampling with dilated sampling, you could apply an IPAdapter to effectively re-condition diffusion on only the background portion of the global image that your sliding window is diffusing over. Although the diffusion model may not have been trained on a large number of samples of background imagery, if IPAdapter is applied to each patch, the model may nonetheless be guided toward generating background features rather than foreground features. IPAdapter is very fast, requiring only a single 224x224 area of pixels, which are passed through CLIPVision and then a small network to generate four 1024-wide vectors; these vectors are applied using cross-attention to the layers of the U-Net to guide diffusion.
An alternative to this approach with IPAdapter would be to apply a LoRA at each zoom level that has been specifically trained on zoomed samples to help guide diffusion to produce background-appropriate imagery. Loading in a few LoRAs for each zoom level might be less intensive than applying a bunch of IPAdapters, which would require running the latents through the VAE decoder to get an image for CLIPVision.