in DiffSynth,what does "model_id": "../models/fashion_models/91GN31Z1rVS" refer to?

zhangtao22 commented 1 year ago

In your genius paper, you just mentioned "We randomly select 10 source videos from the dataset and fine-tune Stable Diffusion 1.58 on each video, respectively. The fine-tuned models have learned the appearance of each fashion model; in other words, one fine-tuned diffusion model represents a virtual fashion model".How should I fine tune this model based on https://huggingface.co/runwayml/stable-diffusion-v1-5. In 2_fashion_video_synthesis.json,what does "model_id": "../models/fashion_models/91GN31Z1rVS" refer to?

Artiprocher commented 1 year ago

The model_id refers to the path of fine-tuned model. We use this script to fine-tune Stable Diffusion v1.5. If you are not familiar to diffusers, please read the tutorial of diffusers.

iamjadhav commented 1 year ago

Hi @Artiprocher ,

Continuing the question, it would be helpful to know if you're referring to "alibaba-pai/pai-diffusion-general-large-zh" or "alibaba-pai/pai-diffusion-general-xlarge-zh" models on your hf (or a different one) when you said path of fine-tuned model. Thanks!

zhangtao22 commented 1 year ago

Hi @Artiprocher, for this step, did you use pose to finetune https://huggingface.co/runwayml/stable-diffusion-v1-5?In the following, you mentioned that finetune with pose.But "There is a fashion model with fashion clothes in each video. We randomly select 10 source videos from the dataset and fine-tune Stable Diffusion 1.58 on each video, respectively. The fine-tuned models have learned the appearance of each fashion model; in other words, one fine-tuned diffusion model represents a virtual fashion model."is not mentioned pose. If don't use pose in this step. How did you finetune just using text？I mean just one prompt: "A woman. Fashion clothes." genenrated different frames?

Artiprocher commented 1 year ago

Hi @iamjadhav, I'm not referring to alibaba-pai/pai-diffusion-general-large-zh or alibaba-pai/pai-diffusion-general-xlarge-zh. In the task of fashion video synthesis, we use Stable Diffusion v1.5 for fine-tuning. For example, if you want to synthesize the fashion model in a video, you need to construct a small dataset consisting of the frames of this video and then fine-tune Stable Diffusion v1.5 using this dataset. The ten fine-tuned diffusion models are not released publicly yet, but you can easily obtain them by yourself.

Artiprocher commented 1 year ago

Hi @zhangtao22 , I think this is an interesting question. In our experiments, we observed that a fine-tuned model based on Stable Diffusion v1.5 is compatible with the ControlNet models trained for Stable Diffusion v1.5. Thus we directly fine-tune Stable Diffusion v1.5 using the video frames without pose information. The pose information is only delivered to the model by ControlNet in the inference stage. In fact, the prompt is not important, because the different frames is generated mainly according to the openpose ControlNet.

zhangtao22 commented 1 year ago

hi @Artiprocher . I mean in this stage."Thus we directly fine-tune Stable Diffusion v1.5 using the video frames without pose information" One person has several frames,right? And there is no pose involved in this stage,right? Therefore, how can you finetune sd15 in order to generate some different frames for one same person? Based on my understanding, the inputing conditions are same for each frame.

Artiprocher commented 1 year ago

Hi @zhangtao22. One person has several frames, and there is no pose involved in the training stage. It's right.

In the training stage, the prompts are the same and we don't use ControlNet. Note that the diffusion model is trained to fit a probabilistic distribution, not a deterministic image. In theory, once we finish the training, the model can generate one frame in the training dataset. But which frame? It depends on the random seed (i.e., the initial Gaussian noise). In other words, the diffusion model has learned the appearance information of the fashion model, but it cannot control the pose.
In the inference stage, the prompts are the same and we use openpose ControlNet. As we mentioned above, the diffusion model can only generate a fashion model with random pose. We use ControlNet to help it generate different frames as we need.

If you are still confused about the "probabilistic distribution" of diffusion models, we recommend you read some theoretical papers. For example, Deep unsupervised learning using nonequilibrium thermodynamics. Unlike other tasks of computer vision, there are typically no unique answers in image synthesis. This paper models the diffusion process as a Stochastic Process. Essentially, a diffusion model represents a map from a Gaussian distribution to a real-world-image distribution.

iamjadhav commented 1 year ago

Hi @Artiprocher ,

Thank you for clearing my doubt. On a similar note, Could you take a look at #332 if possible? I think it might be because of the mismatch between controlnet_models and processors.

zhangtao22 commented 1 year ago

hi @Artiprocher so, that means during the whole process, you only train(finetune base on sd15) once. Then you extract poses and depths,all extracted from the target videos. Finally, you use the finetune model along with controlnet injected by targeted poses and depths to generate the target videos. Is that right?

Artiprocher commented 1 year ago

Hi @zhangtao22 . Yes.

zhangtao22 commented 1 year ago

@Artiprocher with this method，can you change the person's clothes? Can I give this person a clothes that is different from the clothes in the train dataset? X person with X clothes in the train dataset. Then Can I give X person Y clothes(X person's another picture) as the inference image for inference?

Artiprocher commented 1 year ago

@zhangtao22 . Obviously, no. If you want to do this, you need a "clothes-changing" model for an image-to-image pipeline. However, we found that this is a very difficult task. If we find a breakthrough in this application scenario, we will consider adding a new example to DiffSynth.

zhangtao22 commented 1 year ago

@Artiprocher Thank you！I assume that even if used a model changed the clothes on this person. Videos generated by the modified image won't be good. Because the trained images are all in the previous clothes

alibaba / EasyNLP

in DiffSynth,what does "model_id": "../models/fashion_models/91GN31Z1rVS" refer to? #331