amparore / leaf

A Python framework for the quantitative evaluation of eXplainable AI methods
16 stars 8 forks source link

About Drug Consumption Dataset #4

Closed singsinghai closed 1 year ago

singsinghai commented 1 year ago

Hi @amparore,

You mentioned the use of Drug Consumption dataset in your experiments. I wonder how you decide what the label for that dataset is, and how should it be binary classification.

I have tried to execute LEAF on regression or multi-class problem and it didn't seem to work out, so I assumed every experiment you carried out was binary classification. Please correct me if I had wrong take on this.

I would also be very appreciated if you can share the preprocess steps for the data, or the code to carry out the experiments. If they are not able to be shared, my apologize for over-asking.

Thanks

amparore commented 1 year ago

For drug consumption, we decided to use one of the various targets (caffeine) as the target for binary classification, since drug consumption is not a pure classification dataset. To work with regression, leaf will probably require some changes in the code. I am not sure if I still have the code that was not loaded, as I have changed job and I do not have access any longer to the machine where the full runs were made. In principle, the preprocessing were similar to that of heartrisk, but some datasets required a few adjustments to be used for binary classification. The paper summarized briefly the choices we made for the datasets and the classifiers used in the tests. https://peerj.com/articles/cs-479/

singsinghai commented 1 year ago

Thanks for the information. I also want to ask about the approximate time for each dataset, because I have tried to reproduce your experiment on "arrhythmia" dataset and it is very time consuming (more than 6 hours), whereas breast-cancer and heartrisk only take about 40mins. Maybe the large number of features will make the explain process take more time (due to feature forward selection method when k=4). Did you consider the setting to speed up the process?

amparore commented 1 year ago

Yes, it is very time consuming , particularly when computing all the explanations on large datasets. I was running these experiments on a good server machine. I did not consider speeding up the process, as I was interested in testing LIME/SHAP as they are supposed to be used. A setup to speed up the process could be based on simplifing the dataset, by performing first a feature reduction method.

singsinghai commented 1 year ago

Thanks Amparore. I have tried to replicate your method on testing stability of different dataset/model for LIME and this is my result: image

It looks like there is huge difference comparing to the result of the same method that you tested:

image

Do you have any comment on what might lead to the difference in the result? Is it normal that I somehow get a result like this? Update more info for process steps: fillna with mean and use the whole dataset to train the model, then explain 100 instances at len(dataset)/100 step.

amparore commented 1 year ago

Is your image about a single selected sample, or is about all the dataset's samples? If you are using the methods of leaf, you will get the result for a single sample. The image in the paper is on the whole dataset (and it is showing the distributions of the reiteration similarity, where each dot is the score for an individual sample).

amparore commented 1 year ago

Is your image about a single selected sample, or is about all the dataset's samples? If you are using the methods of leaf, you will get the result for a single sample. The image in the paper is on the whole dataset (and it is showing the distributions of the reiteration similarity, where each dot is the score for an individual sample).

singsinghai commented 1 year ago

I tried LEAF on 100 different instances of each dataset, each instance runs 50 explanations and get 1 jaccard distance of the features selected. The boxplots are the distributions of 100 jaccard distances from 100 LEAF calls.

singsinghai commented 1 year ago

I think the breast cancer case is due to bad blackbox model performance, I will review for the more appropriate preprocess step on the data. I’m not sure about the other 2 cases

amparore commented 1 year ago

The methodology should be the same of the plot, so I wonder if the problem is in the trained models. Can you also try with the linear model (which should be the one giving the least surprises?)

singsinghai commented 1 year ago

Yes I will try a RidgeClassifier tomorrow to see the result. I was thinking that Logistic is a better classifier so I didnt count linear model in

singsinghai commented 1 year ago

Hi @amparore , I tried to get the Linear model result with SGDClassifier (as described, it is a simple linear model with SGD optimization). I also reduce number of loops to test this. The stability for the linear is surprisingly low on the other hand. Update: I realized that I put ID into breast-cancer so that the stability was super bad.

image

I'm not sure if my stability calculation is correct, but it should be since stability reduce when K features increase

exp = explainer.explain_instance(**params)
        res.append(np.array(exp.local_exp[1])[:, 0])

    return (1 - pdist(np.stack(res, axis=0), 'jaccard')).mean()
amparore commented 1 year ago

I am not sure it is correct. The definition we used for reiteration similarity was that you are looking at what features end up in the best K, not their weights. So i would stick to that definition (mostly because it is much more difficult and fuzzy to also incorporate the weights). So I would stick to the get_lime_stability() method to be sure that the computation is on the features indexes, not on the LIME weights.

Is this the problem?

Edit: reiteration similarity is called stability in the code, due to a late renaming in the paper.

singsinghai commented 1 year ago
image

The exp.local_exp[1])[:, 0] is actually the indices of the chosen features, not the weight. I think my stability calculation is correct. I didn't use get_lime_stability from leaf since it would have to rerun the process with extra steps for other metrics. I focus on the stability only in this experiment.

amparore commented 1 year ago

Can you use get_lime_stability() just to check if it is consistent with what you are computing?

singsinghai commented 1 year ago

Hi @amparore , Fortunately, applying LEAF to get stability seems to return more similar result of your methodology. I think the problem was from my end. Thanks for suggesting to retry your method to test the experiment.

image