How to generate image from Image+Text?

Zeqiang-Lai / Anything2Image

Generate image from anything with ImageBind and Stable Diffusion

190 stars 23 forks source link

How to generate image from Image+Text? #4

Closed bakachan19 closed 1 year ago

bakachan19 commented 1 year ago

Hi. Thanks for the great work you have provided. In the readme I saw that there are several supported tasks:

Audio to Image Audio+Text to Image Audio+Image to Image Image to Image Text to Image Thermal to Image Depth to Image: Coming soon.

I am new to this type of applications, so I was wondering if it is possible to generate and image from image +text? For example, given an image of a dog and the text "pink flowers" I would like to generate an image that contains a dog and pink flowers. If so, could you provide the code for an example? I was looking at the code in the api.py and I am a bit confused of the use of the prompt and text. Moreover, do I need to normalize the embeddings of the image and text before summing them together, or should I need to normalize the summed embedding?

I greatly appreciate your help. Thanks.

Zeqiang-Lai commented 1 year ago

I don't have time to implement it now, you could refer to https://github.com/Zeqiang-Lai/Anything2Image/blob/681958d5fb77d6063d4942034dd0a2aa310f5e13/anything2image/api.py#L76 to implement by yourself. The normalization has already handled. In a nutshell, the text and image should not be normalized. The audio should.

The stable-diffusion-unclip we used take two condition, (1) prompt (2) clip image embedding.

When we replace the clip image embedding with imagebind embedding, we could achieve anything2image.

The prompt in api.py refer to the prompt mentioned before. The text refer to the text imagebind embedding, which will replace the image embedding and feed into the diffusion model.

bakachan19 commented 1 year ago

Thanks!

bakachan19 commented 1 year ago

Sorry for bothering you again. I was going through the original imagebind code and it looks like the image embeddings are normalized to l2: https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#L421

modality_postprocessors[ModalityType.VISION] = Normalize(dim=-1)

but not temperature scaled. Is there a reason why you skip normalization in your implementation?

        if image is not None:
            Image.fromarray(image).save('tmp.png')
            embeddings = model.forward({
                imagebind.ModalityType.VISION: imagebind.load_and_transform_vision_data(['tmp.png'], device),
            }, normalize=False)
            image_embeddings = embeddings[imagebind.ModalityType.VISION]
            os.remove('tmp.png')

Thank you for your time!

Zeqiang-Lai commented 1 year ago

It is obtained via test and trial. I didn't dive into the theory too much due to the limitation of time.

bakachan19 commented 1 year ago

Oh, I see. Thanks.