PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
https://pixart-alpha.github.io/PixArt-sigma-project/
GNU Affero General Public License v3.0
1.47k stars 70 forks source link

sigma vs alpha #21

Closed liangwq closed 3 months ago

liangwq commented 3 months ago

The images generated by this version seem not to adhere to instructions as well as the previous alpha version did, and the quality of the generated images also appears to be inferior to that of the previous version. Why is this the case? Wasn't this version of the model expanded in terms of parameters or trained with an extended dataset based on the previous version?

liangwq commented 3 months ago

https://loc.dingtalk.com/L1VzZXJzL3FpYW4tbHdxL0xpYnJhcnkvQXBwbGljYXRpb24gU3VwcG9ydC9pRGluZ1RhbGsvMzQ2MzM3NDFfdjIvSW1hZ2VGaWxlcy8xNzEyNTgyNzgyNDM5X0FDNUJFNTc1LTI4NDktNEVFNy1CRDJDLUFBQjkyNDBBMUN

https://loc.dingtalk.com/L1VzZXJzL3FpYW4tbHdxL0xpYnJhcnkvQXBwbGljYXRpb24gU3VwcG9ydC9pRGluZ1RhbGsvMzQ2MzM3NDFfdjIvSW1hZ2VGaWxlcy8xNzEyNTg0ODE3NTcxXzAxMEVFNDg3LTQ3MjUtNDVBRi04MzhDLTU0MzM1QzMyMTQwMy5wbmc=?renderWidth=384&renderHeight=672&renderOrientation=1&isLocal=1CRS5wbmc=?renderWidth=1024&renderHeight=1536&renderOrientation=1&isLocal=1

ApolloRay commented 3 months ago

https://loc.dingtalk.com/L1VzZXJzL3FpYW4tbHdxL0xpYnJhcnkvQXBwbGljYXRpb24gU3VwcG9ydC9pRGluZ1RhbGsvMzQ2MzM3NDFfdjIvSW1hZ2VGaWxlcy8xNzEyNTgyNzgyNDM5X0FDNUJFNTc1LTI4NDktNEVFNy1CRDJDLUFBQjkyNDBBMUNhttps://loc.dingtalk.com/L1VzZXJzL3FpYW4tbHdxL0xpYnJhcnkvQXBwbGljYXRpb24gU3VwcG9ydC9pRGluZ1RhbGsvMzQ2MzM3NDFfdjIvSW1hZ2VGaWxlcy8xNzEyNTg0ODE3NTcxXzAxMEVFNDg3LTQ3MjUtNDVBRi04MzhDLTU0MzM1QzMyMTQwMy5wbmc=?renderWidth=384&renderHeight=672&renderOrientation=1&isLocal=1CRS5wbmc=?renderWidth=1024&renderHeight=1536&renderOrientation=1&isLocal=1

Can't open this link.

liangwq commented 3 months ago

AC5BE575-2849-4EE7-BD2C-AAB9240A1CBE 010EE487-4725-45AF-838C-54335C321403

ApolloRay commented 3 months ago

AC5BE575-2849-4EE7-BD2C-AAB9240A1CBE 010EE487-4725-45AF-838C-54335C321403

I think you have to use a benchmark to eval these two models.

liangwq commented 3 months ago

AC5BE575-2849-4EE7-BD2C-AAB9240A1CBE 010EE487-4725-45AF-838C-54335C321403

I think you have to use a benchmark to eval these two models.

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

ApolloRay commented 3 months ago

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

截屏2024-04-09 10 50 55 I can't understand the consistency. In the paper, same prompt can't get same result.

ApolloRay commented 3 months ago

The images generated by this version seem not to adhere to instructions as well as the previous alpha version did, and the quality of the generated images also appears to be inferior to that of the previous version. Why is this the case? Wasn't this version of the model expanded in terms of parameters or trained with an extended dataset based on the previous version?

I think you have to read this paper at first. It's different from the embedding length, captionner method , etc.

liangwq commented 3 months ago

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

截屏2024-04-09 10 50 55 I can't understand the consistency. In the paper, same prompt can't get same result.

Yes, you see the difference between the effects given in the paper and the actual test results. I think the effects mentioned in the paper are acceptable, but how big is the difference under the same prompt in actual tests?

lawrence-cj commented 3 months ago

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

It would be better to use the same initial noise for the two models. Besides, the sigma model is finetuned on the alpha model with the changed dataset. It would be the main reason for the two model have differeces.

ApolloRay commented 3 months ago

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

It would be better to use the same initial noise for the two models. Besides, the sigma model is finetuned on the alpha model with the changed dataset. It would be the main reason for the two model have differeces.

Today, the DMD model wil be released ?

lawrence-cj commented 3 months ago

1024 ckpt today. DMD then.

ApolloRay commented 3 months ago

1024 ckpt today. DMD then.

thx.

liangwq commented 3 months ago

This is because I used the same prompt for generating effects. The two models should be able to maintain consistency in following the same instructions; the new model has enhanced certain capabilities. Otherwise, the model lacks continuity.

It would be better to use the same initial noise for the two models. Besides, the sigma model is finetuned on the alpha model with the changed dataset. It would be the main reason for the two model have differeces.

Do you mean that if I keep the seed consistent, the similarity between the two effects will be higher? Or how should I keep the noise consistent? Currently, the sigma model automatically layouts the graphic size through text recognition, and this aspect of size cannot be manually controlled. What I mean is, how did you achieve consistent prompt effects in your paper? Have you validated it on more different style prompts, or did you randomly select some prompts as golden samples for testing before release?

lawrence-cj commented 3 months ago

During our test, we use the diffusers version(coming soon) which may be slightly different from the inference code in this repo. But we choosed the same noise as input here for testing: https://github.com/PixArt-alpha/PixArt-sigma/blob/592d4650ed5ad5f4efbc24376181cd519a9fa5b2/scripts/interface.py#L115

Besides, if you are familiar with the diffusion model' s training, the results may just be different as the training going on between different epoches. Or do you know any method that can help to make the model performs similar content but higher quality during training? I would like to give it a try.

liangwq commented 3 months ago

During our test, we use the diffusers version(coming soon) which may be slightly different from the inference code in this repo. But we choosed the same noise as input here for testing:

https://github.com/PixArt-alpha/PixArt-sigma/blob/592d4650ed5ad5f4efbc24376181cd519a9fa5b2/scripts/interface.py#L115

Besides, if you are familiar with the diffusion model' s training, the results may just be different as the training going on between different epoches. Or do you know any method that can help to make the model performs similar content but higher quality during training? I would like to give it a try.

Okay, thank you. Regarding what you mentioned about maintaining consistency across updated versions of the model, I do have some thoughts on this. I've indeed given this some consideration, but it's just an idea. If it's not advisable, you might still want to see if it's feasible:

Generally speaking, the reasons for inconsistency seem to be threefold:

  1. Differences in the conditional latent distribution
  2. Differences in the text-to-image alignment model
  3. Differences in the diffusion generation process

Since we hope to maintain stability through iterations during training, and ensure that later versions are better than earlier ones, we indeed need to ensure that the conditional latent is as consistent as possible between versions, or at least consistent in a macro sense. If we aim to optimize the description details and fine-grained alignment, then perhaps we could change the wording and detail description methods to teach the model to express details (for example, ensuring the text is compressed into a space distribution that is as consistent as possible).

For the overall text-image pairs that do not meet expectations, we could have the model correct them to the correct representation in the new version.

That is, the iteration of subsequent models should be doing baseline model sft, rlhf alignment. If secondary pre-training is indeed necessary, it should also only focus on learning from images that were poorly represented.

lawrence-cj commented 3 months ago

Would you pls DM me in the discord community so that we can have a further discussion? I'm curious about the idea.

liangwq commented 3 months ago

Would you pls DM me in the discord community so that we can have a further discussion? I'm curious about the idea.

Can you give me your Discord ID? How can I contact you?

lawrence-cj commented 3 months ago

https://discord.gg/rde6eaE5Ta