MarcelRobeer / explabox

Explore/examine/explain/expose your model with the explabox!
https://explabox.readthedocs.io
GNU Lesser General Public License v3.0
14 stars 0 forks source link

filter_words with token_frequency method #11

Closed GS756 closed 1 year ago

GS756 commented 1 year ago

Summary of bug

When using box.explain.token_frequency, the filter_words option does not work well: even if we specify a list of stop_words, such words still appear at the top of the ranking.

Environment information

Reproducing the bug

Steps to reproduce the behavior:

from explabox import Explabox
from explabox import import_data
from explabox import import_model

data = import_data({'train': df_train, 'test': df_test}, data_cols='text', label_cols='label', 
                   label_map=labels_dict)
model = import_model(pipe, data, label_map=labels_dict)

box = Explabox(data=data,
               model=model,
               splits={'train': 'train', 'test': 'test'})

import string
punctuations = string.punctuation

from spacy.lang.en.stop_words import STOP_WORDS
#nlp = spacy.load("en_core_web_lg")
stop_words = spacy.lang.en.stop_words.STOP_WORDS

filter_list = list(stop_words)+list(punctuations)

box.explain.token_frequency(splits='test', explain_model=False, labelwise=True, filter_words=filter_list)

Solutions Attempted

I tried with different lists but it was never filtered.

MarcelRobeer commented 1 year ago

Have traced the issue to https://git.science.uu.nl/m.j.robeer/text_explainability/-/blob/main/text_explainability/global_explanation/__init__.py#L136 where

[(w, counts[v]) for w, v in cv.vocabulary_.items() if k not in filter_words]

should be

[(token, counts[v]) for token, v in cv.vocabulary_.items() if token not in filter_words]

will fix it tonight, add proper testing in text_explainability==0.6.6 and ensure the Explabox requires the updated version.