fairlearn / fairlearn-proposals

Proposal Documents for Fairlearn

MIT License

9 stars 6 forks source link

Revised metrics proposal #15

Open riedgar-ms opened 4 years ago

riedgar-ms commented 4 years ago

Create a fresh proposal for metrics, based on discussions in meetings, and focusing on 'what does the user type?' rather than the implementation.

riedgar-ms commented 3 years ago

Meeting 2020-09-08

Thanks to all for attending today. The main agreement is that we should think about a 'data exploration' interface (provided by the object) and a 'scorer' interface (provided by wrapper functions for the objects).

More minor points:

condition_on= should produce columns (at construction time), and not just be specified at aggregation time. This is because it's natural to look down columns for anomalies, and not have to jump back and forth to check which row you're on. These could be done as 'simple' columns (and be mutually exclusive with doing multiple metrics at once), or apparently DataFrames can have a MultiIndex on their columns.
We don't need most wrapper functions (such as accuracy_score_group_summary() etc.). The key ones to have, and to have available for make_scorer() are ones like demographic_parity_difference() and equalized_odds_ratio()
Perhaps we don't need all the various options for ratios() and differences(). Indeed, perhaps these would be best handled in the wrapper functions, and keep the methods on the objects constrained?

Next Steps

I will take @MiroDudik 's list of scenarios from above, and start to rough out how things might look in each API:

Report one disaggregated metric in a data frame.
Report several disaggregated metrics in a data frame.
Report several performance and fairness metrics of several models in a data frame.
In abeyance Report several performance and fairness metrics as well as some disaggregated metrics of several models in a data frame.
Create a fairness-performance raster plot of several models.
Run sklearn.model_selection.cross_validate with a demographic parity (as a fairness metric) and balanced error rate (as the performance metric). We should perhaps review this in the context of Adrin's PR in sklearn?
Run GridSearchCV with demographic parity and balanced error, where the goal is to find the lowest-error model whose demographic parity is <= 0.05.

One thing which had been confusing me from before: These are not all expected to be one-liners, but may well be more complex. The idea is to identify which pieces are relevant to the metrics API, and whether we like how that looks.

riedgar-ms commented 3 years ago

For future reference, some code to use MultiIndex for columns and rows:

incomes = [ 'low income', 'high income']
sf1 = ['A', 'B', 'C']
sf2 = ['xs', 'ys']
metrics = ['precision_score', 'recall_score']

columns = pd.MultiIndex.from_product([metrics, incomes], names=['Metrics', 'Income Level'])
rows = pd.MultiIndex.from_product([sf1, sf2], names=["SF 1", "SF 2"])

all_metrics = pd.DataFrame(data=np.random.random([len(rows), len(columns)]), 
                           index=rows,
                           columns=columns, 
                           dtype=np.float)
print(all_metrics)

which will give output:

Metrics      precision_score             recall_score            
Income Level      low income high income   low income high income
SF 1 SF 2                                                        
A    xs             0.535197    0.928244     0.280099    0.002208
     ys             0.757752    0.217754     0.946087    0.576684
B    xs             0.326393    0.938239     0.551234    0.800435
     ys             0.768839    0.041547     0.670214    0.362275
C    xs             0.098495    0.761231     0.917003    0.275518
     ys             0.571958    0.135393     0.741593    0.179976

Indexing into the DataFrame is a little cumbersome:

all_metrics.loc[('A', 'xs')][('precision_score', 'low income')] = 0.9
print(all_metrics)

which has output:

Metrics      precision_score             recall_score            
Income Level      low income high income   low income high income
SF 1 SF 2                                                        
A    xs             0.900000    0.928244     0.280099    0.002208
     ys             0.757752    0.217754     0.946087    0.576684
B    xs             0.326393    0.938239     0.551234    0.800435
     ys             0.768839    0.041547     0.670214    0.362275
C    xs             0.098495    0.761231     0.917003    0.275518
     ys             0.571958    0.135393     0.741593    0.179976

One can also get the maximum via all_metrics.max() which has the result

Metrics          Income Level
precision_score  low income      0.900000
                 high income     0.938239
recall_score     low income      0.946087
                 high income     0.800435
dtype: float64

Apart from the sudden transposition of the DataFrame, this does appear to have behaved as expected.

@MiroDudik @hildeweerts ... is this a more useful output format for you? I think it's what we discussed in the meeting.

riedgar-ms commented 3 years ago

Meeting 2020-09-14

Thanks you @hildeweerts @adrinjalali @MiroDudik and @amueller for our meeting today.

We agreed that:

We should have separate .difference() and .difference_from_overall() methods (similarly for ratios) for producing the scalars
Although users could to result.by_group.max() we should provide .group_max() and .group_min() methods
We should keep make_derived_metric() but have the first argument be a string, selecting from the various difference and ratio methods

It also sounded like we should keep the current set of functions built with make_derived_metric() such as accuracy_score_group_min().

According to @adrinjalali , the PR for SLEP006 Option 4 will probably be in the release of SciKit-Learn next April. We should aim to be compatible with that in our scorer objects.

Next Steps

I will update this proposal based on the above points. I will also start a rough implementation of GroupedMetric so that we have actual code to play with.

riedgar-ms commented 3 years ago

Meeting 2020-09-21

Thanks to @adrinjalali @hildeweerts @amueller and @MiroDudik for our meeting today.

Points agreed:

Conditional features should go into the rows (but make sure that they are at a 'higher' level than the sensitive features)
Don't have separate .difference() and .difference_from_overall() methods, but instead have a .difference(method='min_max|to_overall') method, where we can add to the allowed values for the method= argument later (similarly for .ratios())
The .group_min(), group_max(), .difference() and .ratio() methods should produce one result per conditional feature class
Don't worry about supporting dictionaries of functions for now. Single functions and lists of functions are sufficient
Rather than have sample_param_names= in the constructor (as at present), better to have sample_params=dict() and slice those up
For now, make_derived_metric() should produce a single scalar, and not support conditional features

Still TBD:

Do we require users to use functools.partial() rather than having the params= argument to the GroupedMetric constructor? I felt that we were leaning towards keeping the params= argument, but I'm not sure there was consensus

Yet to be discussed:

What to do when particular intersections of conditional and sensitive features are empty. Do we skip that row entirely, or do we return None

adrinjalali commented 3 years ago

I was under the impression we're okay with restricting the metrics' signature and asking users to use partial instead.

riedgar-ms commented 3 years ago

Meeting 2020-09-18

Thanks to @adrinjalali @hildeweerts and @MiroDudik for our meeting today.

We started by discussing the .overall property. Currently, when there are no conditional features, this is a DataFrame with a single row with a special name of 'overall':

>>> target.overall
        recall_score precision_score
overall     0.733333        0.717391

The motivation behind this decision was to ensure that we have a consistent type returned. However, doing so comes at the cost of introducing a 'magic' row name of 'overall'. Making this choice also gives some trouble doing operations such as target.by_group - target.overall, where one gets results such as:

        recall_score precision_score
overall          NaN             NaN
pp               NaN             NaN
q                NaN             NaN

The intended operation in this case is actually:

target.by_group - target.overall.iloc[0]
     recall_score precision_score
SF 0                             
pp      0.0057971     -0.00905797
q     -0.00606061      0.00988142

Continuing with the .difference() method, there are a certain number of behind-the-scenes machinations. However, perhaps these are not necessary:

target.difference(method='to_overall')
        recall_score precision_score
overall   0.00606061      0.00988142
(target.by_group - target.overall.iloc[0]).abs().max()
recall_score       0.006061
precision_score    0.009881
dtype: float64

These are not a problem when there are conditional features, which ensures that there are matching portions of the DataFrame index. Together, these suggest that when there are no conditional features, the return type of .overall should be a Series, with entries corresponding to the metric names. @MiroDudik is going to experiment some more, to verify that operations performed regularly by data scientists are then simple.

We agreed that for now, we should drop the params= argument from the GroupedMetric constructor, and require users to use functools.partial if they need to set other parameters for the metrics (such as the required beta= for fbeta_score()).

When specifying multiple functions in the GroupedMetric constructor, the current pattern is:

fns = [skm.recall_score, skm.precision_score, skm.fbeta_score]
sample_params = [{'sample_weight': wgts},
                 {'sample_weight': wgts},
                 {'sample_weight': wgts}]

result = metrics.GroupedMetric(fns, Y_test, Y_pred,
                               sensitive_features=...,
                               conditional_features=...,
                               sample_params=sample_params)

We agreed that using dictionaries instead of arrays would make more sense. Something like:

fns = {
    'recall': skm.recall_score,
    'prec': skm.precision_score,
    'fbeta_0.5': functools.partial(skm.fbeta_score, beta=0.5) }
sample_params = {
   'recall' : {'sample_weight': wgts},
   'prec': {'sample_weight': wgts},
   'fbeta_0.5': {'sample_weight': wgts}
}

There is one question as to whether sample_params= should be a dictionary-of-dictionaries, or use double underscores (i.e. sample_params={ 'recall__sample_weight': wgts, 'prec__sample_weight': wgts, ... }) as happens in SciKit-Learn. There was a preference to the dictionary-of-dictionaries, though. We agreed that trying to have a special 'broadcast' key (to avoid having to repeat sample_weight multiple times) was probably not worth the extra complications and pitfalls.

Now that the API is pretty much finalised, I will open a new issue to discuss the naming.

riedgar-ms commented 3 years ago

Meeting 2020-10-06

Thanks to @adrinjalali @hildeweerts and @MiroDudik for our discussion today.

We had a long talk about make_derived_metric() and how it would be used. @MiroDudik has shared a hackmd with his concerns and a proposal, and we need to iterate on that. We did decide that the callable object returned from make_derived_metric() should support all the arguments of the base metric (even if we still require functools.partial() for GroupedMetric itself). We therefore need to tell make_derived_metric() which arguments need to be sliced up and which to just pass through. Since there are generally more of the latter, I will restore sample_param_names= to make_derived_metric().

One question which did arise: how to specify the method= argument for the aggregation approaches supported by make_derived_metric(). Should it be specified in the arguments of make_derived_metric(), or in the callable? The consensus was going towards the latter, but I would highlight that that 'reserves' another argument name which then can't be used by the base metric.

Related to these, although not essential for API design: do we provide all the accuracy_score_group_min() wrappers for users, or expect them to create them themselves (one line for each they want).

We agreed to keep 'empty' combinations of features in the result tables as NaNs; users can easily drop those rows if they want. In future, we might add an optional argument to drop them when the objects are created.

When it comes to naming, we still have issue #17 . I have the impression that no-one particularly likes GroupedMetric as a name, but there's a lack of better alternatives.

Next steps:

Continue with discussion of issues @MiroDudik raised. We need to close on this ASAP
I will move the new code out of experimental and replace the existing metrics implementation

riedgar-ms commented 3 years ago

Meeting 2020-10-12

Thanks to @MiroDudik and @adrinjalali for our meeting today. We agreed on the following:

The class will be renamed from GroupedMetric to MetricsFrame. We're using the plural 'Metrics' to make it plain that there won't be similar 'US' and 'Imperial' frames as well
conditional_feature= is still not quite agreed. @MiroDudik is going to ask for more feedback around the lab, but it appears there isn't much agreement. We may just have to pick something and stick with it
group_min() will stay as it is. We can't think of anything concise which has any less potential for ambiguity
Definitely keep the method= argument with that name
Have the valid values for method= be between_pairs and to_overall. We felt that minmax was reusing a different term, and that max_to_min for difference would need min_over_max for ratio, which is inconsistent. The new proposal has the form 'preposition-noun' for both. Again, we could be more precise, but at a huge conciseness cost
I will restore a default of method=between_pairs to match the current behaviour
We talked about sample_params= vs sample_props= but decided to stick with the former

We also discussed make_derived_metric() but have come to realise that that is a lot thornier that we'd appreciated. @adrinjalali highlighted some potential future problems we could be laying for ourselves, when it comes to passing arguments around. The simplest example of this is the method= argument on the returned callable, and how that can conflict with arguments to the base metric (it gets more complicated than that). This is an interaction point with the work @adrinjalali is currently doing in SciKit-Learn around routing arguments through pipelines, so we need to ensure that we're compatible and consistent with that.

Next steps

@riedgar-ms will start on the renaming of GroupedMetric to MetricsFrame tomorrow (13th) unless there are objections
@riedgar-ms will start writing proper docstrings, for review by all interested parties
@riedgar-ms will remove make_derived_metric() from the PR for now

Personally, I would even suggest that make_derived_metric() does not need to be in the v0.5.0 release. Yes, we'd lose some functionality compared to now, but we're almost certain to break existing code anyway. So long as the MetricFrame is present, users can implement their own versions of make_derived_metric() which work in their particular cases (while we work on the general case). We can even provide a sample in the docs, with a note that there are pitfalls.

riedgar-ms commented 3 years ago

Meeting 2020-10-19

Thanks to @adrinjalali , @hildeweerts and @MiroDudik for our meeting today. We were reviewing the current active PR implementing the new metrics API.

This PR does not include make_derived_metric() and we agreed to keep it that way
We appear to be converging on make_derived_metric() but we agreed not to hold up this PR for that
We also agreed that make_derived_metric() is not essential for a v0.5.0 release, although it would be good if we could get it ready in time. We can always do a v0.5.1 release afterwards

For the PR itself the following issues were raised:

@MiroDudik is unsure about the default naming of the sensitive and control features if there are none from the user. @MiroDudik is going to experiment, and propose an alternative (to the current S/CF [n] names)
@riedgar-ms will add the method= argument to the DP and EO functions
@riedgar-ms will fix the documentation of the control_feature argument to the MetricFrame constructor, so that it contains a reminder of what control features actually are. Right now, you need to read the main body of the constructor help text to see
@riedgar-ms will add sensitive_levels and control_levels properties to the MetricFrame, which will report the number of sensitive and control features in the data. This to to help users who want to stack and unstack the overall and by_group DataFrames

@hildeweerts raised other issues with the documentation format, pointing that the entirefairlearn.metrics API page is a bit of a mess (for example, it would be better to group properties and methods separately within a class). @hildeweerts pointed to the SciKit-Learn documentation as a possible model. However, we agreed that these criticisms, while absolutely important, were broader than just this PR. In the current PR, we will just make sure that the text which is there is correct.

@MiroDudik and @riedgar-ms will work to resolve the issues raised above, and get the PR merged. @adrinjalali and @hildeweerts will also review, but the merge will not be delayed for that (we can make further tweaks next week).

riedgar-ms commented 3 years ago

Meeting 2020-10-26

Thanks to @hildeweerts @adrinjalali and @MiroDudik for another productive meeting. We discussed the following:

What to do with control_levels when there are none

If the user wants to do their own manipulations of the by_group DataFrame, they are going to be calling Pandas methods such as max(levels=lvls), groupby(levels=lvls) and unstack(levels=lvls), and they are likely to want to set lvls to the value of the control_levels property. Unfortunately, Pandas isn't consistent in how these different methods behave when lvls is None or []. This raises the question of what control_levels should return if none were specified by the user

We decided that control_levels should be None if none were passed. @adrinjalali pointed out that this is safer due to the mutability of lists, and also conceptually consistent with the user not specifying a control_levels argument to the constructor.

Whether to allow strings for level names

This is related to the previous point. The levels= argument in Pandas can usually be a list of level names or indices, but Pandas allows names to be integers, giving a potential for ambiguity. We may revisit this in future, but we shall require our levels to be strings.

How to handle single callables

@MiroDudik pointed out that if a user makes a call such as:

acc = MetricsFrame(skm.accuracy_score, y_true, y_pred, sensitive_features=A)

then having to write things like

acc.overall['accuracy_score']
acc.by_group['accuracy_score']['group_0']
acc.group_max()['accuracy_score']

is tedious for people messing around with notebooks. It would be good to drop the extra redirection. However, @riedgar-ms was concerned that if a more general library is attempting to consume Fairlearn in a general way, then any logic we put in to handle the single function case would then have to be undone by that higher level library (introducing maintenance overhead and places for bugs to hide).

We agreed that if the user passes in a dictionary of callables, then we keep the same behaviour, even if there's only one entry in the dictionary. However, if the user passes in a callable, then DataFrames becomes Series, and Series become scalars, so that users don't have to keep typing [metric_name] to get at their results. @adrinjalali pointed out that this would behave like an overloaded function in languages which supported such things.

The initial implementation of this is likely to be messy (at this point in time, just set a flag and process all output before it's returned to the user). However, @riedgar-ms will provide a full set of tests, so that the internals can be improved in the future.

riedgar-ms commented 3 years ago

Meeting 2020-11-02

Thank you to @adrinjalali @hildeweerts and @MiroDudik for another productive discussion.

We agreed that the best way forward is to release what we have as v0.5.0 as soon as possible. @hildeweerts and @adrinjalali will take a look from a user perspective, but we won't hold a release except for showstopper bugs.

We also discussed make_derived_metric() now that it's reappearing. We won't hold v0.5.0 for this; it can go out with v0.5.1 a few weeks later. In the call, we adjusted some arguments. @riedgar-ms is going to have named arguments be required, and add documentation. Ways of getting ride of the manipulations of globals() will also be investigated, and reflecting on the user-provided function for a method= argument.