YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
https://proceedings.mlr.press/v235/yang24ai.html
MIT License
1.7k stars 99 forks source link

Trying to replicate: Multi-people with complex attribute binding #23

Closed RahulSinghalChicago closed 9 months ago

RahulSinghalChicago commented 9 months ago

Any insights on what went wrong?

CMD python RPG.py --user_prompt 'A couple, the beautiful girl on the left, silver hair, braided ponytail, happy, dynamic, energetic, peaceful, the handsome young man on the right detailed gorgeous face, grin, blonde hair, enchanting' --version_number 0 --api_key $GPT_KEY --use_gpt --model_name 'albedobaseXL_v20.safetensors' --base_ratio .2 --activate True --batch_size 3 --cfg 8 --steps 30 --height 1024 --width 1024

Models Using GPT4 API with albedobaseXL_v20.safetensors

Results 00016-1234 00017-1235 00018-1236

Log from GPT4

waiting for GPT-4 response
### Original Caption:
"A couple, the beautiful girl on the left with silver hair and braided ponytail, who looks happy, dynamic, energetic, and peaceful; and the handsome young man on the right with a detailed gorgeous face, a grin, and blonde hair, enchanting."

### Key phrases identification:
We identify a couple and four attributes for the girl, three for the man:
1. Girl's silver braided ponytail (hair features)
2. Girl's mood and demeanor (happy, dynamic, energetic, peaceful)
3. Man's face detail and expression (gorgeous face, grin)
4. Man's blonde hair (hair features)

Splitting the image into four subregions may be unnecessary since we could amalgamate certain features to maintain simplicity in accordance with rule 4. Therefore, I will group the girl's mood and demeanor into one region and the man's face detail and expression into another.

### Split Ratio Planning:
#### Horizontal Split Ratio: 1;1
- This ratio splits the image into two horizontal rows, each dedicated to one person in the couple.
#### Vertical Split Ratio: `1,(1,1); 1,(1,1)`
- By using this ratio, each row is further divided equally into two vertical subregions, representing the key features of each person.

#### Detailed Subregion Prompts:
1. **First Row** (`1,(1,1)`):
   - **Region 0:** The girl's lustrous silver hair fashioned into an intricate braided ponytail, an emblem of her grace and poise.
   - **Region 1:** Her radiant expression encapsulating happiness and energy, with a demeanor that exudes dynamism and serenity.
2. **Second Row** (`1,(1,1)`):
   - **Region 2:** The young man's face, capturing every striking detail, from his sharp jawline to the warmth of his inviting grin.
   - **Region 3:** The allure of his blonde hair, styled effortlessly, enhancing his charming and enchanting presence.

#### Composition Logic:
- The first row foregrounds the girl's distinctive hairstyle and her emotional state, reflecting her personality and vibrancy.
- The second row focuses on the young man's face and hair, delineating his attractive features and the charm he radiates.

#### Aesthetic Considerations:
- The silver braided ponytail and the girl's expressive demeanor create a picture that connects with viewers emotionally, highlighting her traits in a positive light.
- The man's handsome face with a grin paired with his enchanting blonde hair offers a captivating visual appeal, contrasting yet complementing the girl's portrayal.

By structuring the layout plan with these focal points, each region effectively presents specific aspects of the subject's appearance and emotion in a vivid and engaging manner. The layout ensures a coherent and aesthetically appealing image.

### Output:
Horizontal split ratio: 1;1
Vertical split ratio: 1,(1,1); 1,(1,1)
Split ratio: 1,(1,1); 1,(1,1)
Regional Prompt: 
The girl's radiant silver hair, woven into a beautiful braid that flows into a ponytail, reflecting her elegance and poise. BREAK
Revealing her expressions of happiness and vitality, her face a picture of dynamic energy and peaceful contentment. BREAK
The young man's detailed, handsome face, his eyes warm and inviting, framed by a confident grin that speaks volumes about his character. BREAK
His blonde hair, the color of summer wheat, styled in a way that adds to his enchanting allure and overall handsomeness.
Horizontal split ratio: 1;1
Vertical split ratio: 1,(1,1); 1,(1,1)
Split ratio: 1,(1,1); 1,(1,1)
Regional Prompt: 
The girl's radiant silver hair, woven into a beautiful braid that flows into a ponytail, reflecting her elegance and poise. BREAK
Revealing her expressions of happiness and vitality, her face a picture of dynamic energy and peaceful contentment. BREAK
The young man's detailed, handsome face, his eyes warm and inviting, framed by a confident grin that speaks volumes about his character. BREAK
His blonde hair, the color of summer wheat, styled in a way that adds to his enchanting allure and overall handsomeness.
{'split ratio': '1,1,1; 1,1,1', 'Regional Prompt': "The girl's radiant silver hair, woven into a beautiful braid that flows into a ponytail, reflecting her elegance and poise. BREAK\nRevealing her expressions of happiness and vitality, her face a picture of dynamic energy and peaceful contentment. BREAK\nThe young man's detailed, handsome face, his eyes warm and inviting, framed by a confident grin that speaks volumes about his character. BREAK\nHis blonde hair, the color of summer wheat, styled in a way that adds to his enchanting allure and overall handsomeness."}
select_checkpoint: albedobaseXL_v20.safetensors [a928fee35b]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1,1,1; 1,1,1', 0.2, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)
[Errno 2] No such file or directory: '/app/RPG-DiffusionMaster/regional_prompter_presets.json'
Init / preset error.
fatal: No names found, cannot describe anything.
1,1,1; 1,1,1 0.2 Horizontal
Regional Prompter Active, Pos tokens : [24, 20, 27, 26], Neg tokens : [0]
select_checkpoint: albedobaseXL_v20.safetensors [a928fee35b]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:24<00:00,  1.21it/s]
Total progress: 30it [00:26,  1.14it/s]
RahulSinghalChicago commented 9 months ago

all of the splits are 1?

The notebook worked for split humans, so do I need a template example that goes with that one?

YangLing0818 commented 9 months ago

all of the splits are 1?

The notebook worked for split humans, so do I need a template example that goes with that one?

Thanks for your comments. Actually, for this scenario, we have conducted ablation experiment in our paper. 图片 1 You need to use larger Base ratio to get satisfactory results. It is recommended to use smaller CFG, which could help to generate better results. As for the split ratio, I make slight adjustments, and here I show my results about your example prompt:

python RPG.py --user_prompt 'A couple, the beautiful silver braided ponytail girl on the left happy and peaceful, the handsome blonde hair young man on the right detailed gorgeous face.' --api_key ‘your api key’--use_gpt --use_base --base_ratio 0.5 --base_prompt 'a couple are chatting

Here are some results I get from batch_size=4:

image

The stability of the MLLMs' response relies on our template library. Our repository's template is currently a trial version. Just stay tuned for our upcoming series of updates. Thank you for your attention!

mrschmiklz commented 9 months ago

@YangLing0818 Thanks!

RahulSinghalChicago commented 9 months ago

all of the splits are 1?

The notebook worked for split humans, so do I need a template example that goes with that one?

Thanks for your comments. Actually, for this scenario, we have conducted ablation experiment in our paper. ![图片 1]You need to use larger Base ratio to get satisfactory results. It is recommended to use smaller CFG, which could help to generate better results. As for the split ratio, I make slight adjustments, and here I show my results about your example prompt:

python RPG.py --user_prompt 'A couple, the beautiful silver braided ponytail girl on the left happy and peaceful, the handsome blonde hair young man on the right detailed gorgeous face.' --api_key ‘your api key’--use_gpt --use_base --base_ratio 0.5 --base_prompt 'a couple are chatting

Here are some results I get from batch_size=4:

The stability of the MLLMs' response relies on our template library. Our repository's template is currently a trial version. Just stay tuned for our upcoming series of updates. Thank you for your attention!

Could you post the log that came back from GPT for this one?

Edit: Also what is the adjusted split ratio?

YangLing0818 commented 9 months ago

all of the splits are 1?

The notebook worked for split humans, so do I need a template example that goes with that one?

Thanks for your comments. Actually, for this scenario, we have conducted ablation experiment in our paper. ![图片 1]You need to use larger Base ratio to get satisfactory results. It is recommended to use smaller CFG, which could help to generate better results. As for the split ratio, I make slight adjustments, and here I show my results about your example prompt: python RPG.py --user_prompt 'A couple, the beautiful silver braided ponytail girl on the left happy and peaceful, the handsome blonde hair young man on the right detailed gorgeous face.' --api_key ‘your api key’--use_gpt --use_base --base_ratio 0.5 --base_prompt 'a couple are chatting Here are some results I get from batch_size=4: The stability of the MLLMs' response relies on our template library. Our repository's template is currently a trial version. Just stay tuned for our upcoming series of updates. Thank you for your attention!

Could you post the log that came back from GPT for this one?

Edit: Also what is the adjusted split ratio?

The GPT-4 response are as follows:

Original Caption: "A couple, the beautiful silver braided ponytail girl on the left happy and peaceful, the handsome blonde hair young man on the right detailed gorgeous face."

Key Phrases Identification: For the couple, we have two main subjects, each with distinct features:

  1. Beautiful girl with a silver braided ponytail, described as happy and peaceful.
  2. Handsome young man with blonde hair, described as having a detailed and gorgeous face. Since the subjects are side-by-side rather than one above the other, we only need to split the image into columns, not rows.

Split Ratio Planning: Horizontal Split Ratio: 1

Detailed Subregion Prompts:

  1. Single Row (1):
    • Region 0: The girl on the left, her silver braided ponytail reflecting the soft lighting, embodying a serene and joyful aura.
    • Region 1: The young man on the right, his blonde hair and finely detailed features creating a striking portrait of handsomeness.

Composition Logic:

Aesthetic Considerations:

Output: Horizontal Split Ratio: 1
Vertical Split Ratio: 1,1
Final Split Ratio: 1,1
Final Regional Prompt:
The girl on the left, her silver braided ponytail reflecting the soft lighting, embodying a serene and joyful aura. BREAK
The young man on the right, his blonde hair and finely detailed features creating a striking portrait of handsomeness.

RahulSinghalChicago commented 9 months ago

I was able to reproduce. For those curious, it's imperative to put a period at the end of the base prompt for consistency, at least with albedobaseXL_v20.safetensors

python RPG.py --user_prompt 'A couple, the beautiful silver braided ponytail girl on the left happy and peaceful, the handsome blonde hair young man on the right detailed gorgeous face.' --api_key ‘your api key’--use_gpt --use_base --base_ratio 0.5 --base_prompt 'a couple are chatting.'

Here's the same seed with and without the ending punctuation mark.

00007-1238 00013-1238