Open yejy53 opened 9 months ago
@yejy53 Training data can be constructed using huggingface datasets. Each sample should contain three data columns, blueprint (line drawing), image pomrpt, image (the image expected to be generated). The training part of the readme should be introduced. The reference image, that is, the image prompt, is encoded by VIT and used for cross attention, and VAE is not used. blueprint is injected into UNet through additional convolutional layers. The input of VAE has not been replaced and is still the image expected to be generated.
Thank you for your outstanding contributions. Could you kindly provide your email address? I have several specific inquiries that require your insight.
@yejy53 Of course, aihao2000@outlook.com
Thank you for your great work, but I have a few questions. 1. If I need to train on a new data set, how should I set the data set format? 2. Compared with the original controlnet, is it just a matter of replacing the original text encoder with image VAE input?