fairlearn / fairlearn-proposals

Proposal Documents for Fairlearn
MIT License
9 stars 6 forks source link

Revised metrics proposal #15

Open riedgar-ms opened 4 years ago

riedgar-ms commented 4 years ago

Create a fresh proposal for metrics, based on discussions in meetings, and focusing on 'what does the user type?' rather than the implementation.

riedgar-ms commented 3 years ago

Meeting 2020-09-08

Thanks to all for attending today. The main agreement is that we should think about a 'data exploration' interface (provided by the object) and a 'scorer' interface (provided by wrapper functions for the objects).

More minor points:

Next Steps

I will take @MiroDudik 's list of scenarios from above, and start to rough out how things might look in each API:

  1. Report one disaggregated metric in a data frame.
  2. Report several disaggregated metrics in a data frame.
  3. Report several performance and fairness metrics of several models in a data frame.
  4. In abeyance Report several performance and fairness metrics as well as some disaggregated metrics of several models in a data frame.
  5. Create a fairness-performance raster plot of several models.
  6. Run sklearn.model_selection.cross_validate with a demographic parity (as a fairness metric) and balanced error rate (as the performance metric). We should perhaps review this in the context of Adrin's PR in sklearn?
  7. Run GridSearchCV with demographic parity and balanced error, where the goal is to find the lowest-error model whose demographic parity is <= 0.05.

One thing which had been confusing me from before: These are not all expected to be one-liners, but may well be more complex. The idea is to identify which pieces are relevant to the metrics API, and whether we like how that looks.

riedgar-ms commented 3 years ago

For future reference, some code to use MultiIndex for columns and rows:

incomes = [ 'low income', 'high income']
sf1 = ['A', 'B', 'C']
sf2 = ['xs', 'ys']
metrics = ['precision_score', 'recall_score']

columns = pd.MultiIndex.from_product([metrics, incomes], names=['Metrics', 'Income Level'])
rows = pd.MultiIndex.from_product([sf1, sf2], names=["SF 1", "SF 2"])

all_metrics = pd.DataFrame(data=np.random.random([len(rows), len(columns)]), 
                           index=rows,
                           columns=columns, 
                           dtype=np.float)
print(all_metrics)

which will give output:

Metrics      precision_score             recall_score            
Income Level      low income high income   low income high income
SF 1 SF 2                                                        
A    xs             0.535197    0.928244     0.280099    0.002208
     ys             0.757752    0.217754     0.946087    0.576684
B    xs             0.326393    0.938239     0.551234    0.800435
     ys             0.768839    0.041547     0.670214    0.362275
C    xs             0.098495    0.761231     0.917003    0.275518
     ys             0.571958    0.135393     0.741593    0.179976

Indexing into the DataFrame is a little cumbersome:

all_metrics.loc[('A', 'xs')][('precision_score', 'low income')] = 0.9
print(all_metrics)

which has output:

Metrics      precision_score             recall_score            
Income Level      low income high income   low income high income
SF 1 SF 2                                                        
A    xs             0.900000    0.928244     0.280099    0.002208
     ys             0.757752    0.217754     0.946087    0.576684
B    xs             0.326393    0.938239     0.551234    0.800435
     ys             0.768839    0.041547     0.670214    0.362275
C    xs             0.098495    0.761231     0.917003    0.275518
     ys             0.571958    0.135393     0.741593    0.179976

One can also get the maximum via all_metrics.max() which has the result

Metrics          Income Level
precision_score  low income      0.900000
                 high income     0.938239
recall_score     low income      0.946087
                 high income     0.800435
dtype: float64

Apart from the sudden transposition of the DataFrame, this does appear to have behaved as expected.

@MiroDudik @hildeweerts ... is this a more useful output format for you? I think it's what we discussed in the meeting.

riedgar-ms commented 3 years ago

Meeting 2020-09-14

Thanks you @hildeweerts @adrinjalali @MiroDudik and @amueller for our meeting today.

We agreed that:

It also sounded like we should keep the current set of functions built with make_derived_metric() such as accuracy_score_group_min().

According to @adrinjalali , the PR for SLEP006 Option 4 will probably be in the release of SciKit-Learn next April. We should aim to be compatible with that in our scorer objects.

Next Steps

I will update this proposal based on the above points. I will also start a rough implementation of GroupedMetric so that we have actual code to play with.

riedgar-ms commented 3 years ago

Meeting 2020-09-21

Thanks to @adrinjalali @hildeweerts @amueller and @MiroDudik for our meeting today.

Points agreed:

Still TBD:

Yet to be discussed:

adrinjalali commented 3 years ago

I was under the impression we're okay with restricting the metrics' signature and asking users to use partial instead.

riedgar-ms commented 3 years ago

Meeting 2020-09-18

Thanks to @adrinjalali @hildeweerts and @MiroDudik for our meeting today.

We started by discussing the .overall property. Currently, when there are no conditional features, this is a DataFrame with a single row with a special name of 'overall':

>>> target.overall
        recall_score precision_score
overall     0.733333        0.717391

The motivation behind this decision was to ensure that we have a consistent type returned. However, doing so comes at the cost of introducing a 'magic' row name of 'overall'. Making this choice also gives some trouble doing operations such as target.by_group - target.overall, where one gets results such as:

        recall_score precision_score
overall          NaN             NaN
pp               NaN             NaN
q                NaN             NaN

The intended operation in this case is actually:

target.by_group - target.overall.iloc[0]
     recall_score precision_score
SF 0                             
pp      0.0057971     -0.00905797
q     -0.00606061      0.00988142

Continuing with the .difference() method, there are a certain number of behind-the-scenes machinations. However, perhaps these are not necessary:

target.difference(method='to_overall')
        recall_score precision_score
overall   0.00606061      0.00988142
(target.by_group - target.overall.iloc[0]).abs().max()
recall_score       0.006061
precision_score    0.009881
dtype: float64

These are not a problem when there are conditional features, which ensures that there are matching portions of the DataFrame index. Together, these suggest that when there are no conditional features, the return type of .overall should be a Series, with entries corresponding to the metric names. @MiroDudik is going to experiment some more, to verify that operations performed regularly by data scientists are then simple.

We agreed that for now, we should drop the params= argument from the GroupedMetric constructor, and require users to use functools.partial if they need to set other parameters for the metrics (such as the required beta= for fbeta_score()).

When specifying multiple functions in the GroupedMetric constructor, the current pattern is:

fns = [skm.recall_score, skm.precision_score, skm.fbeta_score]
sample_params = [{'sample_weight': wgts},
                 {'sample_weight': wgts},
                 {'sample_weight': wgts}]

result = metrics.GroupedMetric(fns, Y_test, Y_pred,
                               sensitive_features=...,
                               conditional_features=...,
                               sample_params=sample_params)

We agreed that using dictionaries instead of arrays would make more sense. Something like:

fns = {
    'recall': skm.recall_score,
    'prec': skm.precision_score,
    'fbeta_0.5': functools.partial(skm.fbeta_score, beta=0.5) }
sample_params = {
   'recall' : {'sample_weight': wgts},
   'prec': {'sample_weight': wgts},
   'fbeta_0.5': {'sample_weight': wgts}
}

There is one question as to whether sample_params= should be a dictionary-of-dictionaries, or use double underscores (i.e. sample_params={ 'recall__sample_weight': wgts, 'prec__sample_weight': wgts, ... }) as happens in SciKit-Learn. There was a preference to the dictionary-of-dictionaries, though. We agreed that trying to have a special 'broadcast' key (to avoid having to repeat sample_weight multiple times) was probably not worth the extra complications and pitfalls.

Now that the API is pretty much finalised, I will open a new issue to discuss the naming.

riedgar-ms commented 3 years ago

Meeting 2020-10-06

Thanks to @adrinjalali @hildeweerts and @MiroDudik for our discussion today.

We had a long talk about make_derived_metric() and how it would be used. @MiroDudik has shared a hackmd with his concerns and a proposal, and we need to iterate on that. We did decide that the callable object returned from make_derived_metric() should support all the arguments of the base metric (even if we still require functools.partial() for GroupedMetric itself). We therefore need to tell make_derived_metric() which arguments need to be sliced up and which to just pass through. Since there are generally more of the latter, I will restore sample_param_names= to make_derived_metric().

One question which did arise: how to specify the method= argument for the aggregation approaches supported by make_derived_metric(). Should it be specified in the arguments of make_derived_metric(), or in the callable? The consensus was going towards the latter, but I would highlight that that 'reserves' another argument name which then can't be used by the base metric.

Related to these, although not essential for API design: do we provide all the accuracy_score_group_min() wrappers for users, or expect them to create them themselves (one line for each they want).

We agreed to keep 'empty' combinations of features in the result tables as NaNs; users can easily drop those rows if they want. In future, we might add an optional argument to drop them when the objects are created.

When it comes to naming, we still have issue #17 . I have the impression that no-one particularly likes GroupedMetric as a name, but there's a lack of better alternatives.

Next steps:

riedgar-ms commented 3 years ago

Meeting 2020-10-12

Thanks to @MiroDudik and @adrinjalali for our meeting today. We agreed on the following:

We also discussed make_derived_metric() but have come to realise that that is a lot thornier that we'd appreciated. @adrinjalali highlighted some potential future problems we could be laying for ourselves, when it comes to passing arguments around. The simplest example of this is the method= argument on the returned callable, and how that can conflict with arguments to the base metric (it gets more complicated than that). This is an interaction point with the work @adrinjalali is currently doing in SciKit-Learn around routing arguments through pipelines, so we need to ensure that we're compatible and consistent with that.

Next steps

Personally, I would even suggest that make_derived_metric() does not need to be in the v0.5.0 release. Yes, we'd lose some functionality compared to now, but we're almost certain to break existing code anyway. So long as the MetricFrame is present, users can implement their own versions of make_derived_metric() which work in their particular cases (while we work on the general case). We can even provide a sample in the docs, with a note that there are pitfalls.

riedgar-ms commented 3 years ago

Meeting 2020-10-19

Thanks to @adrinjalali , @hildeweerts and @MiroDudik for our meeting today. We were reviewing the current active PR implementing the new metrics API.

For the PR itself the following issues were raised:

@hildeweerts raised other issues with the documentation format, pointing that the entirefairlearn.metrics API page is a bit of a mess (for example, it would be better to group properties and methods separately within a class). @hildeweerts pointed to the SciKit-Learn documentation as a possible model. However, we agreed that these criticisms, while absolutely important, were broader than just this PR. In the current PR, we will just make sure that the text which is there is correct.

@MiroDudik and @riedgar-ms will work to resolve the issues raised above, and get the PR merged. @adrinjalali and @hildeweerts will also review, but the merge will not be delayed for that (we can make further tweaks next week).

riedgar-ms commented 3 years ago

Meeting 2020-10-26

Thanks to @hildeweerts @adrinjalali and @MiroDudik for another productive meeting. We discussed the following:

What to do with control_levels when there are none

If the user wants to do their own manipulations of the by_group DataFrame, they are going to be calling Pandas methods such as max(levels=lvls), groupby(levels=lvls) and unstack(levels=lvls), and they are likely to want to set lvls to the value of the control_levels property. Unfortunately, Pandas isn't consistent in how these different methods behave when lvls is None or []. This raises the question of what control_levels should return if none were specified by the user

We decided that control_levels should be None if none were passed. @adrinjalali pointed out that this is safer due to the mutability of lists, and also conceptually consistent with the user not specifying a control_levels argument to the constructor.

Whether to allow strings for level names

This is related to the previous point. The levels= argument in Pandas can usually be a list of level names or indices, but Pandas allows names to be integers, giving a potential for ambiguity. We may revisit this in future, but we shall require our levels to be strings.

How to handle single callables

@MiroDudik pointed out that if a user makes a call such as:

acc = MetricsFrame(skm.accuracy_score, y_true, y_pred, sensitive_features=A)

then having to write things like

acc.overall['accuracy_score']
acc.by_group['accuracy_score']['group_0']
acc.group_max()['accuracy_score']

is tedious for people messing around with notebooks. It would be good to drop the extra redirection. However, @riedgar-ms was concerned that if a more general library is attempting to consume Fairlearn in a general way, then any logic we put in to handle the single function case would then have to be undone by that higher level library (introducing maintenance overhead and places for bugs to hide).

We agreed that if the user passes in a dictionary of callables, then we keep the same behaviour, even if there's only one entry in the dictionary. However, if the user passes in a callable, then DataFrames becomes Series, and Series become scalars, so that users don't have to keep typing [metric_name] to get at their results. @adrinjalali pointed out that this would behave like an overloaded function in languages which supported such things.

The initial implementation of this is likely to be messy (at this point in time, just set a flag and process all output before it's returned to the user). However, @riedgar-ms will provide a full set of tests, so that the internals can be improved in the future.

riedgar-ms commented 3 years ago

Meeting 2020-11-02

Thank you to @adrinjalali @hildeweerts and @MiroDudik for another productive discussion.

We agreed that the best way forward is to release what we have as v0.5.0 as soon as possible. @hildeweerts and @adrinjalali will take a look from a user perspective, but we won't hold a release except for showstopper bugs.

We also discussed make_derived_metric() now that it's reappearing. We won't hold v0.5.0 for this; it can go out with v0.5.1 a few weeks later. In the call, we adjusted some arguments. @riedgar-ms is going to have named arguments be required, and add documentation. Ways of getting ride of the manipulations of globals() will also be investigated, and reflecting on the user-provided function for a method= argument.