Heidelberg-NLP / MM-SHAP

This is the official implementation of the paper "MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks"
https://aclanthology.org/2023.acl-long.223/
MIT License
20 stars 4 forks source link

Call the model for each "new image" generated with masked features in get_model_prediction(x) #2

Closed hcliucs closed 1 year ago

hcliucs commented 1 year ago

I noticed that in the "get_model_prediction(x)" function, the model is called inside the outer loop (e.g., for i in range(input_ids.shape[0])). But shouldn't it be called within the inner loop (e.g., for k in range(masked_image_token_ids[i].shape[0]))? Also, how do I extend this function to trimodal models? Using triple loop or two parallel inner loops?

LetiP commented 1 year ago

Hi, thanks for the question.

The code calling explainer = shap.Explainer(get_model_prediction, custom_masker, silent=True), will result in a cascade of model calls (of the model we are interpreting) with different masking pattern (you can see this when you print x). Due to the different nature of the image and text modalities, different masking is necessary and we implement this as following:

In custom_masker we set the ids of image and text tokens to 0 for masked entries. As far as text masking is concerned, we are done. But for images, the 0 there just means that for the 0 ids, we have to go to the original image and set the actual pixel values to 0, which is what we do in get_model_prediction for each region that was marked with 0 in custom_masker before. After the masking is done there (after running the inner loop in get_model_prediction), we can call the model to run the prediction of the finally masked inputs.

About to your question about trimodal models, I would need to know what kind of third modality you are working on, because the exact masking procedure (what information you delete and how) is modality-specific, as exemplified by our different handling of image and text modalities. For text, we mask in token space, for image, we mask in pixel space. The choice to mask in pixel space is also motivated by the fact that in our paper, we compare different models and some models work with image patchification (CLIP) and some with an image feature extractor (LXMERT with FasterRCNN). Especially with LXMERT, if you mask one token, this corresponds to an image region that overlaps with another region. So when masking in token space, you do not delete the information entirely as the same part of the image can be in another region selected by FasterRCNN. Therefore we mask in pixel space before we compute image tokens, just to make sure the information from a certain region is not in the features to begin with. To have a consistent and comparable way of masking for all different models, we need to do it in pixel space, as it is there, where the data format of the image is the same for all models.

So, for another modality, you need to decide whether you mask in token space or in input space and if the latter, write the corresponding routine for it that implements the modality-specific masking.