LIME for multiclass/multilabel explanations?

Hellisotherpeople commented 5 years ago

I implemented a multilabel prediction algorithm for NLP text classification.

Basically, use multilabelbinarizer, binary_crossentropy, and a final sigmoid activation. Since I'm not using a softmax on my output neurons - it's allowed to make a prediction on any class.

This means that the probabilities in my "predict_proba" function do not sum to 1, and this causes issues with LIME.

Is there a way around this? Can LIME work for Multilabel classification models?

I also tried a simpler multilabel version of my previous problem, where I guarantee that I have the same number of labels for each instance. My predict_probs function gives a list of arrays, where each array is the probability values for picking the particular label (Which do sum to 1). Can ELI5 handle this kind of Data? Shouldn't it be easy to write a wrapper to handle this?

Hellisotherpeople commented 5 years ago

I AM A GOD (not really but this crazy idea of mine just worked!!!!!)

class MultiLabelProbClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y):
        self.clf.fit(X, y)

    def predict(self, X):
        ret = self.clf.predict(X)
        return ret

    def predict_proba(self, X):
        if len(X) == 1:
            self.probas_ = self.clf.predict_proba(X)[0]
            sums_to = sum(self.probas_)
            new_probs = [x / sums_to for x in self.probas_]
            return new_probs
        else:
            self.probas_ = self.clf.predict_proba(X)
            print(self.probas_)
            ret_list = []
            for list_of_probs in self.probas_:
                sums_to = sum(list_of_probs)
                print(sums_to)
                new_probs = [x / sums_to for x in list_of_probs]
                ret_list.append(np.asarray(new_probs))
            return np.asarray(ret_list)

the_model = MultiLabelProbClassifier(model)
pipe = Pipeline([('text2vec', Text2Vec()), ('model', the_model)])
pipe.fit(X_train, Y_train)

pred = pipe.predict(X_val)

te = TextExplainer(random_state=42, n_samples=300, position_dependent=True)

def explain_pred(sentence):
    te.fit(sentence, pipe.predict_proba)
    t_pred = te.explain_prediction()
    #t_pred = te.explain_prediction(top = 20, target_names=["ANB", "CAP", "ECON", "EDU", "ENV", "EX", "FED", "HEG", "NAT", "POL", "TOP", "ORI", "QER","COL","MIL", "ARMS", "THE", "INTHEG", "ABL", "FEM", "POST", "PHIL", "ANAR", "OTHR"])
    txt = format_as_text(t_pred)
    html = format_as_html(t_pred)
    html_file = open("latest_prediction.html", "a+")
    html_file.write(html)
    html_file.close()
    print(te.metrics_)

Basic idea is to take a set of probabilities that don't sum to 1 and just force them to sum to 1 anyway. So, maybe my probabilities sum to 1.985, then divide each item in the probability list by 1.985

Now, ELI5 / TextExplainer / LIME give me predictions for each label EVEN IN MULTILABEL output problems. All a user has to do is multiply the LIME predicted output by sum_probabilities for the real probabilities.

Maybe someone should add this as a tutorial or a PR into ELI5.

Hellisotherpeople commented 5 years ago

One thing to do is to specify to user what the "sum_probabilities" number is somewhere within the html / text explain pred output (or just do that multiplication for them afterwards)

deweihu96 commented 3 years ago

Hi @Hellisotherpeople, I have a very similar issue with you. I have a medical document, which corresponds to several ICD codes (labels). Actually, there are more than 8000 ICD codes in total, and each document corresponds to about 5 to 10 ICD codes; then I use the multi-hot and the sigmoid function. How's it going with your project now? May you give me some suggestions on my work? Thanks : )

TeamHG-Memex / eli5

LIME for multiclass/multilabel explanations? #337