dccuchile / wefe

WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes the bias measurement and mitigation in Word Embeddings models. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
https://wefe.readthedocs.io/
MIT License
173 stars 14 forks source link

WEAT returns nothing #21

Closed raffaem closed 3 years ago

raffaem commented 3 years ago

I use this code to compute a WEAT:

wefemodel = WordEmbeddingModel(wv, model_name)
query = Query(target_sets, attribute_sets, target_sets_names, attribute_sets_names)
result_weat = weat.run_query(query, wefemodel, 
                                     calculate_p_value = True,
                                     return_effect_size = True)

But sometimes the returning result_weat does not include a p_value key:

KeyError: 'p_value'

I think it depends on the model.

For some of my models the returning dictionary do not include this key.

Is it possible it depends on the model?

raffaem commented 3 years ago

Yeah, the weat return as follows:

{'query_name':  [MY QUERY NAME], 'result': nan, 'weat': nan, 'effect_size': nan}
raffaem commented 3 years ago

Is it possible to know why it is not returning a result?

pbadillatorrealba commented 3 years ago

Hello

Based on what you are describing (that the query returns values in some models and not in others) I could infer that the problem lies in that when transforming the query word sets to embeddings sets there is (at least) one word set that is losing 20% of its words. In this case, WEFE by default invalidates the query making it return None. This could be because the model you are using does not have words in capital letters, does not have words with accents or the words do not exist in its vocabulary.

The behavior of queries invalidated by missing many words is detailed in the warning of this subsection: https://wefe.readthedocs.io/en/latest/user_guide.html#word-preprocessors

You can use the parameter warn_not_found_words=True to see which words are being lost when converting the query to embeddings.

wefemodel = WordEmbeddingModel(wv, model_name)
query = Query(target_sets, attribute_sets, target_sets_names, attribute_sets_names)
result_weat = weat.run_query(
    query, wefemodel, calculate_p_value=True, warn_not_found_words=True,
)

A possible solution would be to use a word preprocessor (specified in the run_query parameter preprocessor_args or secondary_preprocessor_args).


wefemodel = WordEmbeddingModel(wv, model_name)
query = Query(target_sets, attribute_sets, target_sets_names, attribute_sets_names)
result_weat = weat.run_query(
    query,
    wefemodel,
    calculate_p_value=True,
    secondary_preprocessor_args={"lowercase": True, "strip_accents": True},
    warn_not_found_words=True,
)

In practical terms, with this parameter you specify to run_query that for each word o each set, first look for its original version in the model vocabulary and in case it does not find them, preprocess the word (lowercase and without accents) and try again this search.

Pablo.

raffaem commented 3 years ago

Hello,

Thank you for your support and your prompt and detailed answer.

I'm making sure that all the words of the word sets are present in the embedding before running the query. So I don't think that's the problem.

Anyway I think WEFE should throw an exception by default instead of returning nothing.

I will try again next week.

Thank you again