fairlearn / fairlearn-proposals

Proposal Documents for Fairlearn
MIT License
9 stars 6 forks source link

Metrics - Alternative API #3

Closed riedgar-ms closed 4 years ago

riedgar-ms commented 4 years ago

We should consider changing the metrics API. Rather than having individual metrics be functions, we should consider creating a GroupedData object which would accept y_true, y_pred and group_membership as constructor arguments (and optionally sample_weights). Individual metrics would then become methods on that (still returning a GroupMetricResult object). Users would specify their own metrics via an apply_function(my_metric_function) method (where my_metric_function took arguments of y_true, y_pred and sample_weights).

adrinjalali commented 4 years ago

Could you maybe post a pseudo code with what you'd like the API look like?

I'm not sure how that'll work integrating it with sklearn gridsearch though.

MiroDudik commented 4 years ago

Interesting point. That's why should discuss this.

What I was thinking about was a pattern similar to pandas.DataFrame.groupby(). Specifically something like:

# Initialize from a data frame df gd = GroupedData(y_true=df['label'], y_pred=df['prediction'], by=df['sex'], sample_weight=df['sample_weight'])

# Use one of the built-in metrics f2_disparity_diff = gd.fbeta_score(beta=2).disparity_difference() f2_disparity_ratio = gd.fbeta_score(beta=2).disparity_ratio() f2_worst_case = gd.fbeta_score(beta=2).min() f2_within_specific_group = gd.fbeta_score(beta=2)['male'] # or: f2_within_specific_group = gd.fbeta_score(beta=2).by_group['male'] f2_overall = gd.fbeta_score(beta=2).overall()`

# User-supplied metric custom_disparity_diff = gd.apply(custom_score).disparity_difference()

# An alternate pattern f2_disparity_diff = gd.disparity_difference(fbeta_score, beta=2) f2_worst_case = gd.min_metric(fbeta_score, beta=2) f2_within_specific_group = gd.metric_by_group(fbeta_score, beta=2)['male']

# Or we can go back to a current pattern (renaming "group_metric" to "disparity_summary") summary = disparity_summary(fbeta_score, beta=2, y_true=df['label'], y_pred=df['prediction'], by=df['sex'], sample_weight=df['sample_weight']) f2_disparity_diff = summary.disparity_difference() f2_worst_case = summary.min_metric() f2_within_specific_group = summary.metric_by_group['male']

It would be also nice to enable easy evaluation of disparities of multiple metrics and/or multiple models.

MiroDudik commented 4 years ago

[Apologies for all the updates... I'm done!]

adrinjalali commented 4 years ago

I understand from the perspective of designing a scoring API this may look good or make sense, but I find it harder than simple function calls. The learning curve for people to learn a new API is much harder than giving them simple function signatures, and those function calls can easily be used in sklearn. Ideally, the fairlearn metrics should have a signature which can be passed to the make_scorer of sklearn for it to be easily used IMHO.

Is there any of the above examples not possible with the function call API?

riedgar-ms commented 4 years ago

We discussed this yesterday.

We're planning on sticking with the current mechanism with metric_by_group() and make_group_metric(). However, there will be some tweaks and additions.

Firstly, update make_group_metric to accept kwargs and pass them on to the underlying metric function. This will get rid of _skm_wrappers.py (although we should leave the tests in place, to ensure that the argument passing is working.

Secondly, add methods make_disparity_metric() and make_disparity_ratio_metric which work similarly to make_group_metric() but only return the range or range_ratio property of the result.

MiroDudik commented 4 years ago

Let me try to reconcile various downstream goals around metrics (and related to that: plotting)

The downstream goals

  1. metrics that return scalars: this is, so they can be easily used by GridSearchCV etc.
  2. metrics that return simple structured objects (dictionaries / Bunch): this is for creation of some standard charts and export to any downstream consumer (dashboard etc.). It would also be desirable to make it easy to create the tables like the ones I was using in the jupyter notebooks.
  3. stand-alone plotting / charting functions that correspond to all the visualizations / charts supported in the dashboard
  4. functions that return the numerical objects (e.g., arrays representing curves etc.) that are being plotted in all of our visualizations (roughly corresponding to sklearn's roc_curve vs plot_roc_curve)

Let's hold off on 3 and 4 for the time being, and try to address 1 and 2, while keeping things simple.

We can keep the factory make_group_metric of the form

Both can be invoked with the same signature, but the first would return a dict or Bunch, whereas the latter just returns a single field of that dict:

The addition of the field argument to the factory, addresses the issue 1. Replacing GroupMetricResult by a dict allows easy initialization of pandas data structures.

Thoughts?

riedgar-ms commented 4 years ago

I'm not a huge fan of turning datatypes into dictionaries. I created the GroupMetricResult for precisely that reason. In a similar vein, I'm not keep on changing the return type of a function based on the values of the arguments.

Would those be considered more 'pythonic' ways of doing things?

adrinjalali commented 4 years ago

TBH I've heard contradictory things regarding what constitutes "pythonic" and what not. I personally submit to these principles:

That's why I may agree with you that from a design perspective you may be right in all the arguments, but I believe we should also consider the consequences of the decisions we make here on the adoption of the lib. Every new object the user needs to know about, is yet another step in the learning curve.

All that said, I really don't want to intrude your design choices since I'm still familiarizing myself with your codebase.

One thing to mention, is the hack I talked about here to use sample props in scoring functions in a grid search.

riedgar-ms commented 4 years ago

I'm going to experiment with returning a Bunch instead of my own datatypes. We did agree that if we are to provide our own types, then we should implement __str__ and __repr__ as well.

riedgar-ms commented 4 years ago

Same PR to convert the GroupMetricResult to a Bunch: https://github.com/fairlearn/fairlearn/pull/287 There are issues with the notebooks, and the docstrings haven't been updated yet.

riedgar-ms commented 4 years ago

Any thoughts on this PR @MiroDudik @romanlutz ? If we go with this option, I'll see about removing the GroupMetricSet as well, and turning its to_dict() method into a plain subroutine (and add multiple models and grouping vectors).

riedgar-ms commented 4 years ago

Bringing back the discussion on this from the comments on fairlearn/fairlearn#287

In response to further requests, I flattened out the dictionary, so that 'overall' is one key, and the results for each group go into other keys at the same level. Since users might have a group named 'overall' these keys cannot be the group names themselves, but instead are group_{0} strings.

This is a change which is going to break compatibility with the existing dashboard, requiring more changes there. And I really do not see what it gets us. I am not convinced that it is so much easier to learn a set of keys for a dictionary (not to mention the convention for how we turn their group names into keys) than a new type.

The concern about discoverability is valid, and to enable print(group_metric_result) we should certainly implement __str__ and __repr__. However, ripping out the type entirely is going to generate a lot of work for benefits which remain obscure to me.

One type I would like to get rid of is the GroupMetricSet, which I don't think serves a useful purpose. It is better replaced by a single routine which could generate the dictionary required by the Dashboard's caching layer.

riedgar-ms commented 4 years ago

We just had a meeting where we discussed this. For now, fairlearn/fairlearn#287 is not necessary. We might consider it in the future, but some more immediate concerns:

MiroDudik commented 4 years ago

I have just thought of another pattern that might help us resolving these frictions. Let me illustrate on group_accuracy.

The idea is that group_accuracy is a callable object that supports the following kinds of access:

group_accuracy(y_true, y_pred, sensitive_features=A, sample_weight=sw, [other kwargs])

group_accuracy.disparity_difference(y_true, y_pred, sensitive_features=A, sample_weight=sw, [other kwargs])

We could use a similar pattern to return a data frame or dictionary. Or even we could pick a different return type for group_accuracy (say it could be a dict by default), and then use group_accuracy.as_obj to get the behavior that returns a structured object.

Thoughts?

MiroDudik commented 4 years ago

After some more thought, I think I'd prefer to go back to a more pared-down solution based on Bunch.

Let me know what you think!

# For most sklearn metrics, we would have their group version that returns a Bunch with fields
# * overall: overall metric value
# * by_group: a dictionary with a metric value for each sensitive feature

summary = group_accuracy_score(y_true, y_pred, sensitive_features=sf, **other_kwargs)

# Exporting into a data frame is not too crazy

df = pd.DataFrame({'accuracy': {**summary.by_group, 'overall': summary.overall}})

# Several types of scalar metrics for group fairness can be obtained from `summary` via utility functions

acc_disparity = disparity_from_summary(summary)
acc_disparity_ratio = disparity_ratio_from_summary(summary)
min_accuracy = groupmin_from_summary(summary)

# Most common disparity metrics should be predefined

demo_disparity = demographic_disparity(y_true, y_pred, sensitive_features=sf, **other_kwargs)
demo_ratio = demographic_disparity_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs)
eq_odds_disparity = eq_odds_disparity(y_true, y_pred, sensitive_features=sf, **other_kwargs)

# For predefined disparities based on sklearn metrics, we should adopt consistent naming convention

acc_disparity = accuracy_score_disparity(y_true, y_pred, sensitive_features=sf, **other_kwargs)
acc_ratio = accuracy_score_disparity_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs)
min_accuracy = accuracy_score_groupmin(y_true, y_pred, sensitive_features=sf, **other_kwargs)
adrinjalali commented 4 years ago

those function names can probably be more though of (not that I have a better idea ATM), but otherwise this looks very familiar to me :)

romanlutz commented 4 years ago

@MiroDudik should we close this and link it to the proposal? Otherwise we have multiple channels of communication for this.

MiroDudik commented 4 years ago

Will close once the PR is merged.