question about the alpha-clip combined with LLaVA-7b

SunzeY / AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

https://aleafy.github.io/alpha-clip

Apache License 2.0

577 stars 33 forks source link

question about the alpha-clip combined with LLaVA-7b #14

Closed xinli2008 closed 6 months ago

xinli2008 commented 6 months ago

Sorry to bother you in your busy time and i am hurry to cary out alpha-clip with LLaVA-7b-clip. I followed the instructions in here and changed something. The input images and masks are as follows: and the rewrited forward are as follows: But the generated text doesn't seems to be a specific area of concern and i am wondering why. can you give me some useful advice? Thank you

xinli2008 commented 6 months ago

And I tried another methods, given a original image and mask image (mode is L), i combine them to alpha image. the combined image and changed rewrited forward are as follows:

And my prompt is : Describe this image and its style in a very detailed manner. The generated text is : The image features a large black and white dog with a smile on its face, likely a German Shepherd. The dog is the main focus of the image, taking up a significant portion of the frame. The dog appears to be enjoying a moment of happiness, possibly posing for a picture. I don't think the results are too good, because he still seems to be paying attention to the black areas of the image. Can you give me some useful advice?

SunzeY commented 6 months ago

Your mask transform should involve center_crop, same as llava preprocess does.

xinli2008 commented 6 months ago

thank you for your kindly reply~! I try that method, unfortunately, it does not work without transform.centercrop, the visualized image and mask are as follows: The mask area is correctly mapped to the original image, but the generated text is consistent with the result without Alpha-clip

XuRui314 commented 6 months ago

Is this problem solved? Maybe LLaVA need to be fine tuned to match alpha clip's output?

xinli2008 commented 6 months ago

No! we suspect that alpha-clip is a mechanism of attention. As in the image above, this is more difficult when the main body of the image is a banana and we want to use alpha-clip to texture the background area； Good luck

XuRui314 commented 6 months ago

I did encounter this problem when using my own model(LLMs are Vicuna and GLM): "But the generated text doesn't seems to be a specific area of concern and i am wondering why. can you give me some useful advice?" Since the paper mentioned they finetuned LLaVA 1.5 with alpha clip, so i doubt the zero shot stitching ability of alpha clip

xinli2008 commented 6 months ago

Depending on the image, it may be when the subject of the image is more obvious, and we try to ignore the subject completely, which may be difficult to do with the alpha-clip with LLaVA.

SunzeY commented 6 months ago

hi, I believe your case work in some degree when using Alpha-CLIP case1 case2 The official demo is now available now! you can checkout on your own.

XuRui314 commented 6 months ago

hi, I believe your case work in some degree when using Alpha-CLIP The official demo is now available now! you can checkout on your own.

Thanks for sharing the demo. May i ask if it's necessary to fine tune Alpha clip when stitching to a new MLLM? As for me, the zero shot stitching works not well.

SunzeY commented 6 months ago

Our demo doesn't involve finetuning of LLaVA. It only involve replacing original CLIP with Alpha-CLIP. the code is also available in demo/with_llm. I believe a bit finetuning can help you get better result.

XuRui314 commented 6 months ago

Thanks for replying :)