Issues with Multimodal Inference Model Responses

facebookresearch / chameleon

Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.

https://arxiv.org/abs/2405.09818

Other

1.78k stars 107 forks source link

Issues with Multimodal Inference Model Responses #44

Closed puar-playground closed 2 months ago

puar-playground commented 3 months ago

I tried to run multimodal inference following this demo code, but the model keeps responding with excuses such as:

I'm unable to meet that request. I must politely decline that, sorry. I'm sorry, but that's something I cannot do. I'm sorry, but I'm unable to comply with that request.

Has anyone else encountered this issue? I think it might be a mistake in how I loaded the model, but I followed every line of the instructions here:

https://github.com/facebookresearch/chameleon/blame/3356bda40896f73d8c8d03c19694ec1607c477ed/chameleon/inference/examples/multimodal_input.py#L9-L24

ericflo commented 3 months ago

I am seeing a similar thing. I thought it had to do with the VQGAN image encoder going haywire on certain inputs, but every once in a while a request fulfills successfully after a few refusals, so it seems the circuits are there. Things I've been trying: recognizing plants, food, OCR - not things that I'd consider dangerous. Wondering if there's a prompt strategy anyone's found that minimizes incorrect refusals.

bks5881 commented 3 months ago

I have the same problem. I tried different prompts, but all have the same/similar response. :/ Edit: i was able to get simple responses like, "What language is in the image?" -> although incorrectly. I wanted to perform OCR tasks and see how it performed. Any help on what kind of prompts works? i tried prompts that work fine for llama2 and llama3 , llava1.6, but here all I get it is "Sorry i cannot blah blah" .

jacobkahn commented 2 months ago

Question scope generally needs to be pared down to reduce the likelihood that the model will refuse to answer, i.e. more specific questions about an image that are aren't open-ended.