Fairnsss Scoring Keywords & Max Tokens

danielz02 commented 11 months ago

Describe the bug (Initially discovered and reported by UT Austin's VITA group. ) For models with verbose outputs, the max_tokens=20 setting in the fairness perspective is too small. This leads to truncated prediction and incorrect result parsing. In addition, fairness scoring metrics are lacking certain keywords. This might also be a problem when the model is not instruction-following.

To Reproduce

 dt-run +model_config=hf ++model_config.model=hf/meta-llama/Llama-2-13b-chat-hf +fairness=few_shot_tr_br_1.0
 cat results/fairness/results/hf/meta-llama/Llama-2-13b-chat-hf/few_shot_tr_br_1.0.json | jq

Traceback (most recent call last):
  File "/home/danielz/PycharmProjects/DecodingTrust/src/dt/perspectives/fairness/score_calculation_script.py", line 193, in <module>
    aggregate_current_scores(fairness_args)
  File "/home/danielz/PycharmProjects/DecodingTrust/src/dt/perspectives/fairness/score_calculation_script.py", line 142, in aggregate_current_scores
    m_dpd_ = demographic_parity_difference(label, prediction, sensitive_features=sensitive_attr)
  File "/home/danielz/anaconda3/lib/python3.9/site-packages/fairlearn/metrics/_fairness_metrics.py", line 45, in demographic_parity_difference
    sel_rate = MetricFrame(
  File "/home/danielz/anaconda3/lib/python3.9/site-packages/fairlearn/metrics/_metric_frame.py", line 103, in compatible_metric_frame_init
    new_metric_frame_init(self, **metric_arg_dict, **positional_dict, **kwargs)
  File "/home/danielz/anaconda3/lib/python3.9/site-packages/fairlearn/metrics/_metric_frame.py", line 309, in __init__
    sf_list = self._process_features("sensitive_feature_", sensitive_features, y_t)
  File "/home/danielz/anaconda3/lib/python3.9/site-packages/fairlearn/metrics/_metric_frame.py", line 856, in _process_features
    raise ValueError(_TOO_MANY_FEATURE_DIMS)
ValueError: Feature array has too many dimensions

Example Outputs

Based on the attributes you provided, here is a prediction of whether the person's annual income
Based on the attributes you provided, here is a prediction of the person's annual income:
Based on the attributes you provided, it is likely that the person's annual income is less
Check https://gist.github.com/danielz02/48f5672e1047dd694a66a9db1a0bb257 for more outputs.

Expected behavior

The users should be able to change max tokens whenever needed.
The result parsing scripts should include most of the keywords. For example, less is not included as the keyword in fairness.
For results with a high rate of non-matching predictions, the parsing script should end with a warning or throw an error with a meaningful error message.

Proposed Fix

Temporally increase the max_tokens settings for Llama-2-13b-chat-hf
Add more keywords in src/dt/perspectives/fairness/score_calculation_script.py.
Double-check previous results to see whether we had a similar problem

Environment:

Conda Environment + Release Branch

danielz02 commented 11 months ago

I temporarily added an option for max_tokens in the fairness configuration (c9e51f93a7029bc5568d7ac0da0414dfd75b9883). In the future, we should further refactor this argument into GenerationConfig.

danielz02 commented 11 months ago

We also can't parse the results from the crime task properly from time to time. The examples below are from crime_br_0.0 with a similar error above.

41614e22f10df4f30e11156ecea1130

To reproduce, use:

 dt-run +model_config=hf ++model_config.model=hf/meta-llama/Llama-2-13b-chat-hf +fairness=crime_br_0.0

kangmintong commented 11 months ago

Added more key words detection for judgements
Deal the case with a small number of expected predictions in some scenarios which possibly induce problems of fairness score calculation

AI-secure / DecodingTrust

Fairnsss Scoring Keywords & Max Tokens #18