Closed skshvl closed 10 months ago
Hi, thanks for the excellent description of your question. I understand it is complicated. The complicated solution is because of the complexity of the shap library. This is the way we tried to implement what we wanted and yes, I've spent a lot of time reading the shap library to get it done. And surely there is a better way to do it, so feel free to update the code to your needs.
Your question is very related to my answer to this question #2 , so would you please be so kind to read that first and I will try to give you a more personalized answer after you read that answer and you tell me the remaining confusion? Thanks!
@LetiP Thank you, I actually think I understand now after thinking more about your explanation of get_model_prediction. It seems like inputs is actually a variable outside that function that get_model_prediction() is able to access by virtue of it being a global variable within the .py file being run. So the image data does not need to be directly passed to Explainer, since Explainer accesses get_model_prediction() which accesses inputs which has the image pixel data that can be masked. Thanks!
Exactly. Thanks, you got it! 👏
As part of my thesis, I am trying to understand the code in mm-shap_clip_dataset.py, and I'm a bit stumped at the following section, in which we generate the tensor X which is passed to the Explainer instance to generate masks and then SHAP values. I am concerned that in the code as it is written here, X ends up containing no image data -- or at least, I do not understand how it does.
Specifically, X consists of a concatenation of two things: image_token_ids (image) and inputs.input_ids (text)
But while the inputs object contains both text and image data, image_token_ids seems to take no image data from the inputs object's pixel_values (other than in its shape).
Then, by the time we generate the concatenation X, we are combining inputs.input_ids and image_token_ids without having added anything to image_token_ids.
Right after X is assigned, we create an Explainer and pass X to it.
So what I am trying to understand is how does the explainer gets any access to the image data when X consists only of the text data + the blank image_token_ids? Would appreciate any input, thanks!