Poly-cam works for image regression?

MaxPolak97 commented 1 year ago

Thank you for your interesting work and would definitely agree that poly-cam is the new state-of-the-art explainability method. I have been using it for my thesis about Decoding the Art of Robot Tactile Learning with Explainable Neural Networks for Incipient Slip Sensing. I am using Poly-cam to explain how my model predicts the state of slip based on image regression, as shown in the figure.

I want to compare Poly-cam to Grad-cam and IntegratedGrad using the faithfulness scores. However, both insertion and deletion scores seems to be incorrect due to using image regression instead of classification. Would you have any idea how to tackle this?

Using model_softmax = torch.nn.Sequential(model, torch.nn.Softmax(dim=-1)) will always result in a value of 1 for deletion and insertion

Instead of putting in the model only will result to values > 1 (like 53.5) where the metrics should be between 0-1.

image_666 tiff

aenglebert commented 1 year ago

Hello. Thank you for your interest.

I did not defined it on regression, but it can be adapted for sure!

What is the result of your regression? Coordinates? A distance measurement? Is there a maximum and a minimum?

The solution will depend on your output distribution, but here a some possibilities that I can already imagine:

Using the difference between the output of the model with the original input and the masked inputs (similar to a MAE), normalised between 0 and 1 with something similar to torch.abs( delta - delta.min() ) / ( delta.max() - delta.min() ).
Maybe a MSE version of the above could make sense ?
Using a sigmoid instead of a softmax if your output can be simplified as a binary classification (with an outputs between -inf and +inf)
Maybe something else, depending on the usecase

MaxPolak97 commented 1 year ago

Thank you for your prompt response and assistance. I wanted to clarify that our regression model predicts the degree of incipient slip as a continuous value within the range of 0-100. However, as the model is unbounded, predictions be clipped within this range. I think that the MSE is the most feasible method as small errors are tolerable while large errors could cause issues for the use case.

I was wondering if you had any thoughts on how to implement this?

Comparing these methods using the faithfulness metrics will be valuable to recommend the best XAI in the domain of tactile sensing. From examining the saliency maps, I observed that Poly-CAM outperforms other methods due to its high resolution and ability to handle multiple objects, which is necessary to determine the markers that contributed to the prediction.

I am eagerly awaiting your response and appreciate your help.

aenglebert commented 1 year ago

I'm not aware of a faithfullness metric on this specific task (but I don't know everything). And for the insertion and deletion metrics, since they are mainly defined on classification, you should see if you can use them without modification, or if you need to adapt them, meaning that they become whole new metrics (so they need validation).

I don't know enough in the field of your task to be sure of what make sense in term of scoring, so I can try to give you some leads and avenues for discussion, but you will need to check that it is ok for your use case.

I've got some words about xai methods before going further. For most (all?) xai methods initialy defined on classification, the explaination is produced to highlight pixels that maximize the selected output (the class to explain in the case of a classification). In the case of a regression, it's fine to use them like it if you try to explain why the output value is high. If you want to explain why it is close to a specific value, you will need to also redefine the method.

And this should also be aligned with the metric !

If you generate an explaination to show which pixels are linked to a maximization of your regression output value, you can define the metric as quantifying what cause the output to increase or diminish, similar to insertion and deletion in the case of a classification. What seems to be the simplest in your case, is to divide the clipped output by 100, allowing to use the insertion and deletion metrics without modifications.

If you want to explain why the model predict a specific value and not a different one (both higher or lower), you need to modify either the model when using with both the method and metric, or the method and metric themselves. I'm not sure, but maybe changing the output of the model on a masked image could do the trick, in your case, maybe with something similar to (1 - (target_value_to_explain - predicted_value_for_modified_input)/100) ? In this case I would probably need to discuss a bit more and/or read a few paper on your subject to be sure of what make sense to do or not.

MaxPolak97 commented 1 year ago

Thank you for the ideas and clarification. So, in my case I want to validate the new approach of using temporal difference images for capturing incipient slip. The figure left shows that for difference images and the right for the raw image. For the raw image, we humans but also the AI can't actually see this directly, yet the AI has learned a way to use the borders of the image and by looking at all the markers to probably measure the displacement (this is tested for multiple samples). For temporal difference images (raw tactile image before contact - raw tactile image at time T) to capture only the change in deformation, we humans can see how slip evolves from the image but also the AI is looking at this part.

image_550 tiff

So in this case, I will be modifying this part of the code. Then dividing self.model(image)/100 and taking the sigmoid. But in the case of classification drop and increase are based on the confidence score right? In my case both a drop and increase in score are problematic.

base_score = torch.softmax(self.model(image), dim=1)[:, class_idx] score = torch.softmax(self.model(masked_image), dim=1)[:, class_idx] drop = torch.maximum(torch.zeros(1).to(base_score.device), base_score - score) / base_score increase = int(score > base_score)

In the paper, I also have read that using intermediate layers can help to improve the resolution and localization. In my case, I look more like that features detected in previous layers are getting more highlighted while the normal function provide its final attention as we use the final activation for that. What is your opinion about this

image_550 tiff

aenglebert commented 1 year ago

Hi!

Be warned that if you divide your 0-100 score by 100, you should not use a sigmoid. A sigmoid is to use if you have an output similar to a binary classification that is (potentialy) between -inf to +inf, but that not your case if I understood well.

The goal of using intermediate layers is effectively to increase the resolution of the map since you use activations with higher resolutions, but since they are less specific we use the last layers to weight them to be class specific. In your case since you don't have classes, you need to define a scoring function that make sense instead of the class output. If both an increase and a decrease in the value are problematic and you want to explain the exact value produced by the model, you can't directly use an xai method and/or metric "out of the shelf" directly on your output, even normalised in the 0-1 range. You will receive heatmaps that only highlight pixels that will favor an increase of the output value of the model, closer to 100 (or 1 if divided by 100), but pixels moving the output towards 0 will not be considered as important both by the method and the metric.

What you can try is the following (I never tried it, but that's what I think could make sense but I may need some iterations). You define a wrapper around your model to produce a scoring model

class ScoringModel(torch.nn.Module):
    def __init__(self, model):
        super(ScoringModel, self).__init__()
        self.model = model
        self.last_output = None

    def set_baseline(self, x):
        self.last_output = self.model(x)
        return self.last_output

    def forward(self, x):
        assert self.last_output is not None, "No previous output to compare with"
        score = 1 - torch.abs(self.last_output - self.model(x))/100
        return score

scoring_model = ScoringModel(model)

And for each image, you first use the scoring model with the .set_baseline method to adapt to the evaluated image, then do the evaluations (xai method or metric).

# set the baseline to current image output
scoring_model.set_baseline(image1)

# then call the xai method initialised around scoring_model on the current image
xai_method(image1, ...

This will cause the scoring_model to produce values of 1 when the output of the model is unchanged in comparison to image1, and will reduce the output when the value is modified, both increased or reduced. However, in this case the score produced could never be 0, except if the initial image1 output is 0 or 100 (for an output clipped between 0 and 100 of course). So you may want to improve the 1 - torch.abs(self.last_output - self.model(x))/100 to something that produce 0 if you are close to 0 and/or 100. Like substracting the minium between the output and 100 - output or something like that.

What do you think?

MaxPolak97 commented 1 year ago

This would make sense. By simply adding this, this can be achieved.

if (last_output != 0 or last_output != 100) and (output2 < 1 or output2 > 99):
    score = 0

It might also be possible to remove the torch.clip() as predicted values above 100 or smaller than 0 are wrong. But that will be more difficult to normalize between 0 and 1I think.

To account for the fact that the larger the difference between self.last_output and self.model(x), the more the score goes to zero can be achieved with this (MAE => MSE)

score = 1 - torch.pow(torch.abs(self.last_output - self.model(x)), 2) / 100

aenglebert commented 1 year ago

Since it may introduce a sudden step, for exemple from 1 to 0.999..., I would prefered a more progressive approach, but it's up to you.

Be aware that in your current score computation with pow the final score will be between -99 and 1 since you can have value way above 100 inside torch.pow().

If you absolutely want to use a square to increase the difference between small and large variations, you can also try to use the square on the score itself.

score = torch.square( 1 - ( self.last_output - self.model(x) ) /100 )

Since it's bellow or equal to 1, it will reduce low values (resulting from higher differences).

In all case, it will need some testing and adaptations, more than the range itself, what is important is the distribution in this range. Different normalisations can produce very different results, more or less what you want.

But that's research 😉.

MaxPolak97 commented 1 year ago

Hi Alexander,

Could you tell me some more about the Channel-wise Increase of Confidence?

The original PCAM- give a black image but changing the deletion part with the metric as discused, makes it good :). However, it is exactly the same to PCAM+. So I wonder how PCAM+ is already working desirably well. I wonder what your opinion is?

aenglebert / polycam

Poly-cam works for image regression? #1