Cost distance in WachterEtAl implementation

AlexisTabin commented 8 months ago

Hello there!

I'm still currently adapting CF techniques for classification to regression, and I have a small problem with the Wachter & Al. technique producing CFs that are not sparse at all. (see attached fig.)

From what I read in the literature Wachter & Al. paper, they are using a Manhattan distance weighted feature-wise with the inverse median absolute deviation (MAD) as the cost to ensure the sparsity of the CF.

Here, it seems that the cost function only includes the Manhattan distance (when the norm is set to 1).

Is it possible that the cost function is wrong and thus the method fails to provide sparse CF results ? Or is there something else I don't understand?

AlexisTabin commented 8 months ago

After further investigations, I noticed that in the original implementation, the MAD value is computed over the whole X_train dataset.If I am understanding correctly, the MAD factor should not have an impact on univariate data, which is the type of data I am working with. Additionally, using an L1-norm should guarantee that the data is sparse. However, I am confused about the results I obtained with the Wachter CF technique. I am not sure what could be going wrong. Do you have any ideas?

JHoelli commented 8 months ago

Hi @AlexisTabin,

we based our implementation of the WCF Approach on https://github.com/carla-recourse/CARLA/blob/main/carla/recourse_methods/catalog/wachter/library/wachter.py. Note, that none of the implementations is the original implementation belonging to the paper, therefore any of those implementation might not be a 100% correct.

Regarding the loss calculation, we use both the MSE (for the change in prediction) in combination with the L1-Norm. Both are combined in line 94. We indeed do not weigh the L1 Norm with the Mean Absolute Deviation. Introducing Sparsity by optimizing the L1 norm should be sufficient in hindsight of the iterative optimization of the lambda (weighing) factor. Unfortunately, the difference between L1 with MAD and without MAD is not investigated in the Wachter paper. Therefore, I would not rule out that normalizing the results with MAD makes a difference. Building additive optimization function is quite tricky regarding the weights of each term, therefore predicting the optimization outcome without experiment is hard. Further, the MAD would make a difference as the assumption with applying W-CF is that every time step is an "independent feature".

Why is not using the MAD a thing? Not using the MAD has the really nice advantage of the approach being applicable without access to training data.:) For that reason many other approaches based on Wachter do not include the MAD(e.g., Dandel et al., or the ALIBI implementation. As we build the optimization of TSEvo on the functions of Dandel et al., we decided to use the same sparsity calculation for comparison reasons. And yes, including both versions of W-CF (with and without MAD) would be even more meaningful.

For the results, I am not sure whether anything actually went wrong. The gradient descent algorithm of the approach that treats every time step with the independent feature assumption can easily lead to minor changes, that lead to a different classification (-i.e. it is rather an adversarial sample than a counterfactual). You can also see the high sparsity in the benchmarking in e.g., this paper fig 3.

JHoelli / TSEvo

Cost distance in WachterEtAl implementation #7