Closed PhysicianHOYA closed 2 weeks ago
@PhysicianHOYA I have encountered the same issue. After some debugging, I think the problem might lie in the modelling_llava.py , where self.vision_tower(...) is called without the output_attentions
argument, which makes it always not return the attentions from the clip vision model.
Unfortunately, I do not know any elegant solution for now. Nevertheless, for a temporal fix, you can try to modify the above code to manually add the output_attentions
argument and the vit relevancy map will be shown in this app.
@xyliu-cs Thanks for your help. But I still can't get this code to work. If you have solved this problem, could you please provide a working code example?
Hi @PhysicianHOYA , I think you can try to modify the line of code I mentioned earlier into:
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True, output_attentions=True)
, reinstalled the transformers library and run the lvlm-interpret.
@xyliu-cs Thank you very much for your help, the code can run normally.
After uploading the picture, the following problems occurred. Please help me solve the problem, thank you very much.