Sizes of tensors must match except in dimension 1. Expected size 8 but got size 4 for tensor number 1 in the list.

TencentARC / PhotoMaker

PhotoMaker [CVPR 2024]

https://photo-maker.github.io/

Other

9.32k stars 744 forks source link

Sizes of tensors must match except in dimension 1. Expected size 8 but got size 4 for tensor number 1 in the list. #182

Open canrly opened 2 weeks ago

canrly commented 2 weeks ago

[Debug] Generate image using aspect ratio [Instagram (1:1)] => 1024 x 1024 Start inference... [Debug] Prompt: instagram photo, portrait photo of a woman img, colorful, perfect face, natural skin, hard shadows, film grain, [Debug] Neg Prompt: (asymmetry, worst quality, low quality, illustration, 3d, 2d, painting, cartoons, sketch), open mouth 10

Traceback (most recent call last): /photomaker/model.py", line 49, in fuse_fn stacked_id_embeds = torch.cat([prompt_embeds, id_embeds], dim=-1) last line Sizes of tensors must match except in dimension 1. Expected size 8 but got size 4 for tensor number 1 in the list.

rudy2steiner commented 2 weeks ago

met the same issue

channyi commented 2 weeks ago

codewritz-yuri commented 2 weeks ago

zhenhua22 commented 2 weeks ago

same issue

YIYANGCAI commented 1 week ago

I found out why prompt_embeds's dim zero is always 2 times of id_embeds. This is because the num_tokens = 2. Could anyone give a hint of this parameter's correspondence in the original paper?

I think according its original paper's stacking strategy, shouldn't the prompt_embeds be calculated out of the expansion of the embedding of token of "man" or "woman" ([1x2048]) to [id_num x 2048], and be concated with id_embeds to be [id_num, 4096] then be processed by MLPs? However, in the code's implementation, prompt_embeds are sliced from the original text_embedding with the length of (id_num * num_token).