What should the metrics API look like?

fairlearn / fairlearn-proposals

Proposal Documents for Fairlearn

MIT License

9 stars 6 forks source link

What should the metrics API look like? #12

Open hildeweerts opened 4 years ago

hildeweerts commented 4 years ago

To get the conversation going on what the metrics API should look like and better explain my own view, I created a quick mock-up: metrics.txt. Please feel free to comment!

An issue @MiroDudik brought up is that this design does not provide a way to retrieve both "group" scores and "overall" scores, something that was possible with the previous Summary object. I am not sure what would be a nice way to incorporate this pattern, so if people have thoughts: please share!

Tagging @romanlutz @riedgar-ms @adrinjalali

romanlutz commented 4 years ago

Thanks for getting this started @hildeweerts !

I have some concerns about the naming but I'll focus on the functionality since we can always adjust the naming. I'll walk through an example based on your file to figure out whether I understand it right.

Let's say I care about accuracy. Then I can instantiate score_groups as

score_groups(y_true, y_pred, sensitive_features, metric='accuracy')

which would mean we need to map those metric strings to actual metric functions. Maybe it's easier to pass the function itself?

score_groups(y_true, y_pred, sensitive_features, metric=sklearn.metrics.accuracy_score)

Regardless of that detail, I'd get a Scores object back. I'm not 100% sure I understand what that one would do, but I'll try:

scores = ... # my Scores object
scores.disparity("difference")  # returns diff between min and max?
scores.difference()  # identical to disparity("difference")
scores.min()
scores.max()
scores.to_dict()  # would that be similar to what the group summary is right now? or would this just be the individual ones for all groups?
scores.to_dataframe()  # that's probably just the scores per group, which makes me think that to_dict is also just per group (?)

aggregate seems to be a helper function that can give you difference, ratio, max, min etc., right?

This reminds me very much of our existing group_summary by the way! That one looks like this:

group_summary(metric_function, y_true, y_pred, *, sensitive_features, indexed_params=None, **metric_params)

and there are individual ones per metric, e.g. accuracy_score_group_summary(y_true, y_pred,....). The only odd part there is that you need to pass this group_summary object to functions like difference_from_summary(...) to get the values out of there, whereas the scores.difference() way of writing it is more natural. Perhaps even scores.difference ?

For each individual metric there's the <metric>_grouped way of using it.

false_positive_rate_grouped(y_true, y_pred, sensitive_features, disparity="difference", aggregation="max")

What I'm wondering is how that's different from

score_groups(y_true, y_pred, sensitive_features, metric='false_positive_rate').disparity('difference')

I remember that the complex object (here called Scores) is not necessarily what we want in every situation, so perhaps redundancy is intended (?).

hildeweerts commented 4 years ago

which would mean we need to map those metric strings to actual metric functions. Maybe it's easier to pass the function itself?

Perhaps both should be possible? I don't think we'll be adding a lot of "base" metrics (we should probably think of a good name for those) on a regular basis, so we can just use scikit-learn's predefined values in addition to others that are not directly accessible from scikit-learn (e.g. false positive rate). For first time users, I can imagine having strings to choose from is a bit less intimidating than having to find (or even define) the appropriate (scikit-learn) function yourself? But this is just speculation from my side :)

aggregate seems to be a helper function that can give you difference, ratio, max, min etc., right?

This is a good question - I think my naming is not super clear. I was actually thinking that there should be one type of function for getting the "disparity" between all groups (i.e. ratio or difference); i.e. scores.disparity("difference") would return all differences between groups. In the case of two sensitive groups, this is just a single number, but in the case of multiple sensitive groups, you still need to aggregate it somehow. At the moment, Fairlearn only allows for one possible aggregation, which is the difference between min and max. IMO, given that fairness can have so many different meanings in different contexts, it is important allow for some flexibility in defining a fairness metric. However, I'm not sure whether having a separate argument/function does more harm than good - i.e. is the increase in flexibility worth the increase in complexity?

This reminds me very much of our existing group_summary by the way! [...] and there are individual ones per metric, e.g. accuracy_score_group_summary(y_true, y_pred,....). The only odd part there is that you need to pass this group_summary object to functions like difference_from_summary(...) to get the values out of there, whereas the scores.difference() way of writing it is more natural. Perhaps even scores.difference ?

From what you describe I think it's very similar indeed! I agree that writing something like scores.difference() is more natural (pandas style), especially because this pattern allows you to easily access all possible methods in most IDE's. If you use an attribute rather than a method, does that mean you have to precompute all of them? In that case, I'd be in favor of a method.

I remember that the complex object (here called Scores) is not necessarily what we want in every situation, so perhaps redundancy is intended (?).

Yes, this redundancy is intended! Mostly because you can use something like false_positive_rate_grouped() directly as a scorer function in things like scikit-learn's grid search.

riedgar-ms commented 4 years ago

I would also be in favour of sticking to just passing around functions, rather than allowing for 'string or callable' all over the place. This may reflect the fact that I'm far more used to typed languages, and I'm seeing a lot of tests being required to make sure the string->fn mappings are set up correctly. I agree that score_groups() looks rather like the existing group_summary(), minus the support for passing extra arguments down into the underlying metric.

Can you clarify what the result of calling scores.to_dict() or scores.do_dataframe() would look like?

Supporting different aggregation methods (beyond simple max-min) is something I'm sure would be useful. I think that the aggregation method might look something like:

class Scores:
...
   def aggregate(self, f):
       return f(self)

We could provide things such as difference() and ratio() for f(), and if users needed some other aggregation, they'd be able to write it themselves fairly easily.

A note on attributes in Python... they're actually methods, and we used to use them in the GroupMetricResult class (removed in favour of Bunches a few months ago): https://github.com/fairlearn/fairlearn/blob/29448633c776fbcb4f697d11c3354ef36b4cfd89/fairlearn/metrics/_group_metric_result.py#L45 In use, they look like fields, but they are actually method calls.

hildeweerts commented 4 years ago

I would also be in favour of sticking to just passing around functions, rather than allowing for 'string or callable' all over the place. This may reflect the fact that I'm far more used to typed languages, and I'm seeing a lot of tests being required to make sure the string->fn mappings are set up correctly.

From a developer perspective, I agree. But from a user perspective, I do think strings can be much easier to use. Particularly if you're familiar with scikit-learn.

Can you clarify what the result of calling scores.to_dict() or scores.do_dataframe() would look like?

It depends a bit on what we choose as an underlying data structure of the scores object. But it would allow (particularly novice) users to easily convert the current object to a data structure they are comfortable working with. E.g. if a user has called scores.difference() they can convert the result to a pandas dataframe and visualize it using e.g. matploblib.

To clarify further: in my head 'disparity' and 'aggregation' are two different steps; in the disparity step you define what a disparity between two groups is (i.e. either 'ratio' or 'difference') and the other how you aggregate those disparities between more than two groups (e.g. the min-max difference, average difference, max difference with one particular group of interest etc.). Would it be clearer if we would have something like scores.difference(aggregate="maxmin") or would that be even more confusing? I am also not sure how to convert that nicely to single scorer functions such as false_positive_rate_grouped(). And it would probably result in even more 'string or callable' patters ;)

I didn't know that about attributes - interesting!

riedgar-ms commented 4 years ago

I would also be in favour of sticking to just passing around functions, rather than allowing for 'string or callable' all over the place. This may reflect the fact that I'm far more used to typed languages, and I'm seeing a lot of tests being required to make sure the string->fn mappings are set up correctly.

From a developer perspective, I agree. But from a user perspective, I do think strings can be much easier to use. Particularly if you're familiar with scikit-learn.

You're probably right.... it just feels dirty and uncomfortable to me.

Can you clarify what the result of calling scores.to_dict() or scores.do_dataframe() would look like?

It depends a bit on what we choose as an underlying data structure of the scores object. But it would allow (particularly novice) users to easily convert the current object to a data structure they are comfortable working with. E.g. if a user has called scores.difference() they can convert the result to a pandas dataframe and visualize it using e.g. matploblib.

Could you propose a specific implementation? Right now, the returned object has an overall field, and a by_groups dictionary for the return (this was a bit more obvious in the days of the GroupMetricResult). I did look into flattening that to a dataframe (since one of the notebooks did that), but I always got hung up on how to do it. Specifically, what to call the rows when there's a danger that one of the subgroups could be called 'overall'.

To clarify further: in my head 'disparity' and 'aggregation' are two different steps; in the disparity step you define what a disparity between two groups is (i.e. either 'ratio' or 'difference') and the other how you aggregate those disparities between more than two groups (e.g. the min-max difference, average difference, max difference with one particular group of interest etc.). Would it be clearer if we would have something like scores.difference(aggregate="maxmin") or would that be even more confusing? I am also not sure how to convert that nicely to single scorer functions such as false_positive_rate_grouped(). And it would probably result in even more 'string or callable' patters ;)

Perhaps something like scores.aggregated_disparity(disparity='difference', aggregate='max') where both arguments can also be functions? The disparity function should probably just take two floats. The aggregate function would need something more elaborate, in case users wanted all their measurements to be relative to a particular subgroup (which they would smuggle into their callable).

hildeweerts commented 4 years ago

Could you propose a specific implementation? Right now, the returned object has an overall field, and a by_groups dictionary for the return (this was a bit more obvious in the days of the GroupMetricResult). I did look into flattening that to a dataframe (since one of the notebooks did that), but I always got hung up on how to do it. Specifically, what to call the rows when there's a danger that one of the subgroups could be called 'overall'.

Ah yes, I forgot about the overall/by_group pattern. Off the top of my head, I can see two possibilities. Either perform the call on the groups attribute only, i.e. you'd have something like scores.by_groups.to_dataframe(). If we only allow for one metric at the time, scores.overall will only contain one number, so I'd say it makes more sense to print it separately anyways.

The second option would be to use a multi-index dataframe. I think this would make the most sense if we allow for multiple metrics in the same summary. Which, if I think about it, would actually be quite useful. Is passing multiple metrics possible in the current API? The resulting dataframe could look something like this:

              fpr   fnr
overall NaN  0.10  0.20
group   A    0.05  0.10
        B    0.20  0.30
        C    0.01  0.05

Perhaps something like scores.aggregated_disparity(disparity='difference', aggregate='max') where both arguments can also be functions? The disparity function should probably just take two floats. The aggregate function would need something more elaborate, in case users wanted all their measurements to be relative to a particular subgroup (which they would smuggle into their callable).

I like this idea! 👍

riedgar-ms commented 4 years ago

I have seen MultiIndexes a bit (they appear in the Moment code), but I'm not fully familiar with their details. If that sort of layout would be useful to data scientists, it seems reasonable, though.

One thing which has just occurred to me: are disparities always symmetric (i.e. would you always define the function such that disparity(a, b) == disparity(b, a)) or are there times when that's not the case? There would be some significant effects on the required aggregate() function if not.

romanlutz commented 4 years ago

@riedgar-ms disparities are not always symmetric. Think of the FPR ratio of two groups, for example.

@adrinjalali and @MiroDudik should probably chime in here as well.

riedgar-ms commented 4 years ago

disparities are not always symmetric. Think of the FPR ratio of two groups, for example

Agreed. For that particular case, one could adjust the specification of the ratio, much like we specify differences to be absolute values. However, I'm sure that there will be other cases not amenable to something like that (especially if someone tries defining disparities on confusion matrices).

Per a separate discussion which @MiroDudik and I had about a presentation I was putting together, rather than calling it disparity(a, b) we might want to call it fairness_comparator(a, b). Thoughts on that?

The aggregate() function is more complicated. In addition to the matrix of fairness_comparator() results (which may not be symmetric), we also need to supply comparisons to the 'overall' metric (evaluated on the whole dataset) - I only understood some of the kinks on the GridSearch() pareto fronts when @MiroDudik pointed out that the 'dominant model' function worked relative to the overall metric, while the plot was for difference between subgroups.

We could include these in the matrix of fairness_comparator() results, but we'd have to be rather careful about indexing. There's always a danger that any name we might pick for 'overall' could also be a subgroup name (I've alluded to this issue above, I believe).

hildeweerts commented 4 years ago

I'm a bit confused where the fairness_comparator(a,b) would come to play exactly. Would it be a replacement for the aggregated_disparity() you suggested, for the function argument disparity, or something else?

If it is the former, I agree something like scores.fairness_comparator(disparity='difference', aggregate='max') would be nice, although I'd probably go for a more active verb like compare. But it depends a bit on what we're going to call the summary/scores object.

I think we should probably raise an error if 'overall' is used as a subgroup name. I think most issues can be dealt with by proper indexing, but there's always the issue of comparing the actual overall to subgroup 'overall'; i.e. in a table with all comparisons you'll have (overall, overall) twice and it's impossible to tell which 'direction' the disparity is going without looking up the original scores.

MiroDudik commented 4 years ago

Let me comment on a couple of issues:

Base metrics

I don't mind allowing both strings as well as callable objects. This is a common pattern around metrics in sklearn, where you can pass a string or a scorer object to crossvalidation routines. In fact our metrics engine already has some dictionaries of that kind, so it would be just a matter of adding another dict.

Aggregators

I am worried about over-engineering things. I don't think we should try to come up with generic aggregators (yet). I was thinking about using the following pattern for the Scores object

# the current behavior
scores.difference()

# some alternatives could be enabled with keyword arguments
scores.difference(relative_to='overall', aggregate='max_flip')
scores.difference(relative_to='fixed', fixed_group='male', aggregate='max')
scores.difference(relative_to='min', aggregate='max_flip') # default

# similarly for ratio:
scores.ratio(relative_to='max', aggregate='min_flip') # default
scores.ratio(relative_to='fixed', fixed_group='male', aggregate='min')
scores.ratio(relative_to='overall', aggregate='min_flip')

The flip above means that you consider both a-b and b-a when taking the difference and a/b and b/a when taking the ratio. So the difference is the "absolute" difference. If it's confusing we could gave a separate scores.difference() and separate scores.signed_difference() and maybe similarly for the ratio.

It's not clear to me that we need to have a mechanism for custom aggregators. My sense is that for any custom behavior, it should be enough for the users to one of the following:

define their own functions that operate on Scores object (kinda like our "difference_from_summary")
extend the Scores class
export Scores into a pandas DataFrame and take advantage of pandas aggregators

Exporting

I'm not yet sure here. There are a few other patterns that might be handy, e.g.,

by_group, overall = scores.to_frames()
difference = (by_group-overall).abs().max()
difference_to_male = (by_group-by_group['male']).min()

This is the one I was thinking about to avoid custom aggregators.

hildeweerts commented 4 years ago

I agree. It's probably best to keep things as simple as possible and allow for flexibility by easy exports to pandas.

I like your suggested pattern to avoid generic aggregators. My hard-wired instinct to make things complicated says we might want to add None options. That allows you to better understand/debug weird results (e.g. due to accidentally flipping the binary target outcome).

I'm wondering what would be a good way to translate this to the fairness metrics, e.g. false_positive_rate_grouped().

riedgar-ms commented 4 years ago

Fair enough on avoiding the custom aggregators - especially if the DataFrame export can reasonably allow users to define things to their hearts' content.

My main concern is the complexity of the 'dispatch' logic at the top of the difference() and ratio() methods. That's going to be an interesting set of tests to write, to make sure we catch all the invalid combinations (unless I'm just thinking too much like a typed programmer again....).

I'm not a huge fan of the flip suffix, but I can't think of anything better.

kevinrobinson commented 4 years ago

hi yall! I just wanted to say that I love the way @hildeweerts framed the rationale for these decisions in the use cases of:

I can imagine that you use this for exploratory analysis. E.g. because you don't know what disparities are likely to occur or because you want to do a thorough analysis before you put your model into production.

and

I can imagine you use this for optimization or tracking during deployment; i.e. when you are aware of the types of disparities that are likely to occur.

I didn't notice that earlier since it's in the attached metrics.txt so I thought I'd call it out here. The first one is the primary need highlighted in Holstein et al. (2019) so that's one good source of validation. I don't know much about the second scenario, but my understanding is that using fairlearn for monitoring or tracking of deployed models is out of scope. Perhaps it might be helpful to understand if folks have different ideas about the user needs here?

It'd be amazing to see that wherever this discussion goes that it ultimately lands on "because of user need A, we are making this change, and this is how the design we chose helps meet that need" in the same way 👍

hildeweerts commented 4 years ago

To expand on the second use case, I was mostly thinking about the 'optimization' part. I imagine a scenario where you first do some exploratory analysis on your initial data and build a simple model. If you find an issue you'd like to keep track of, you can use fairness metrics like false_positive_rate_grouped() during model selection/hyperparameter optimization. Once you've picked a model you can again do a more thorough evaluation using the first approach.

riedgar-ms commented 4 years ago

Circling back to the original API: def score_groups(y_true, y_pred, sensitive_features, metric='accuracy'):

How should we handle extra arguments to the metric function (especially when metric is a callable rather than a string)? There are actually two type of these arguments - those that behave like the y_true etc. arrays (sample weights being a common example, but I'm sure people have come up with other names and uses), and those which should just be passed to the metric function itself. The latter will likely often be strings or scalars, but we can't assume that.

We can add a **kwargs to the end of this signature.... and perhaps another argument_handling dictionary, which would map the string naming each argument something indicating the handling (split and pass_through perhaps?). For the cases where metric was a string, we'd pre-define the necessary dictionaries. Or we could just have a split_arguments argument which would be a list of names (and the remainder assumed to always be passed through untouched). The former has an advantage of verbosity, which would enable us to make sure that users had declared their intent for every argument. However, it also has the disadvantage of verbosity, and users might not like typing. I'm sure there are other possible ways of handling this as well....

One further suggestion for the Scores object: add a property string which records the full name (e.g. sklearn.metrics.accuracy_score) of the underlying metric. That will likely make serialisation easier in future.

hildeweerts commented 4 years ago

I've been thinking a bit more about this and I think the pattern suggested by @MiroDudik is a very good candidate.

As an alternative to the flip suffix, we could also add keyword arguments. For difference() these could be absolute (either True or False) and subtrahend (either 'subgroup' or 'relative_to'). And for ratio() you could have denominator (either 'subgroup' or 'relative_to').

I do think it would make sense to have a None option for aggregate because it allows for easy debugging.

Regarding arguments that need to be passed to a metric function, I'm not sure what the best approach would be. You could define two dicts, one for 'grouped_kwargs' and one for 'regular_kwargs', but I don't really like that solution either. If I recall correctly, @adrinjalali mentioned that the functionality to pass arguments such as sample_weights in scikit-learn pipelines will be added in the near future, perhaps we can get some inspiration there. I can't seem to find the related issue though.

riedgar-ms commented 4 years ago

I do think it would make sense to have a None option for aggregate because it allows for easy debugging.

What would be returned in that case? The matrix of differences (or ratios) between all pairs of the subgroups (plus overall)?

hildeweerts commented 4 years ago

What would be returned in that case? The matrix of differences (or ratios) between all pairs of the subgroups (plus overall)?

If we only allow for aggregate = None (so not relative_to = None), we can return either a data frame or dictionary with the differences for the pairs that are specified by relative_to.

MiroDudik commented 4 years ago

Re. aggregate=None

I'm worried about the function not always returning a scalar. One alternative is something like:

scores.differences(relative_to='fixed', fixed_group='male')        # returns pandas.Series
scores.differences(relative_to='fixed', fixed_group='male').max()  # aggregate as convenient
scores.differences(relative_to='overall').abs().max()

I think that this would be actually a really good pattern to support "explore different metrics" scenario, but less well suited for mitigation / hyperparameter tuning scenarios, so those would still need to be of the form:

scores.difference(relative_to='fixed', fixed_group='male', aggregate='max') 
scores.difference(relative_to='overall', absolute=True, aggregate='max')

Is that better or worse than scores.difference(...,aggregate=None)?

Re. exporting

@hildeweerts (and others), do you have a preference between the following options:

by_group, overall = scores.to_frame()             # overall is returned as a scalar
by_group = scores.to_frame()                      # overall is only accessed as scores.overall
combined = score.to_frame(overall_key='overall')  # overall is included in the frame with a flat index
combined = score.to_frame()                       # overall is included in the frame with MultiIndex

hildeweerts commented 4 years ago

Re. aggregate=None

The scores.differences(relative_to='overall').abs().max() pattern looks convenient! I suspect most data scientists that program in python will be quite familiar with pandas, so at least they won't be bothered too much with new syntax.

Wouldn't the scores object only be used 'under the hood' for the fairness metrics in the mitigation/tuning scenario? I.e. you'd call something like false_positive_rate_grouped(y_true, y_pred, sensitive_features, disparity='difference', relative_to='overall', aggregation='max', fixed_group=None). If we also add additional arguments to avoid a potentially confusing flip suffix that would be a whole lot of arguments though. This doesn't have to be too much of a problem if we pick sensible defaults and document it well. Nevertheless, we need to think carefully about the trade-off between simplicity and flexibility. I'm a bit worried that if we don't add enough flexibility, the added convenience of Fairlearn compared to plain pandas would be marginal (but as stated before I am predisposed to adding complexity so I'd love to hear other people's views on this).

Re. exporting

I like by_group, overall = scores.to_frame() and by_group = scores.to_frame() best. Multi-indexing is nice for displaying tables but can be a bit annoying to process further if you're not super familiar with pandas.

Support multiple metrics

Do we want to support multiple metrics in a single scores object? In all industry projects that I've worked on, there's never been a case where only a single metric was relevant. I can imagine it'd be a bit annoying if you'd have to manually merge the scores to be able to easily compare them in a table.

riedgar-ms commented 4 years ago

My concern with the scores.differences(relative_to='overall').abs().max() pattern is: what do the intermediate calls return? They must be returning objects, due to the chain of method calls.

They might be mutating the internal state of the scores object, and then doing return self. In that case, there could be some interesting edge cases as the internal state gets mutated.

Alternatively, perhaps they return a different object type each time. However, that's really taking us off in a different direction. The GroupMetricResult object I initially created was rather similar to the scores object we have here. We removed that in favour of the current Bunch in order to minimise class creation. If we took this approach, we'd not only (effectively) restore the GroupMetricResult but also add a bunch of other classes.

Have more thoughts on the other issues; need to go right now.

riedgar-ms commented 4 years ago

For 'multiple metrics in a single object' do you mean something like:

metrics = MetricsObject(y_true, y_pred, sensitive_feature)
accuracy_scores = metrics.accuracy(sample_weights=sample_weights, extra_arg=something)
tpr = metrics.true_positive_rate()

Basically, the first call to a separate metric would calculate the confusion matrix, and it would be cached to generate the others (I believe AIF360 does this)? One immediate problem: how to handle the extra arguments (such as sample_weights) since they could make the confusion matrix non-cacheable.

hildeweerts commented 4 years ago

Re. scores.differences().abs() pattern

If scores.differences() returns a pandas dataframe you get things like abs() and max() for free. Perhaps adding something like scores.differences(as_dataframe=True) or scores.differences().to_dataframe() could help to make this transition from scores object to a dataframe more explicit. Would something like that alleviate some of your concerns?

Re. Support metrics metrics

For multiple metrics I actually meant a pattern like this:

scores = score_groups(y_true, y_pred, sensitive_features, metrics=['accuracy', 'precision', 'recall'])

So you'd still have to pass the base metrics from the start, but the resulting scores are conveniently stored in the same scores object.

Perhaps we can get some inspiration from scikit-learn for handling arguments of metric functions. For 'regular' arguments that need to be passed to a metric, there is sklearn.metrics.make_scorer. AFAIK it currently doesn't support passing things like sample_weights but they are working on this: https://github.com/scikit-learn/scikit-learn/pull/16079.

riedgar-ms commented 4 years ago

For the scores.differences() pattern, I've just spotted @MiroDudik 's comment that this could always be a pandas Series. I've no objection to that, so long as it would cover all cases.

For the multiple metrics, once you have called:

scores = score_groups(y_true, y_pred, sensitive_features, metrics=['accuracy', 'precision', 'recall'])

what does calling scores.differences() return? Would it be a DataFrame, rather than a Series, with the column names corresponding to the selected metrics? That would make some sort of sense, although I'm not a fan of varying return types based on argument values (at least not without some abstract base classes and a factory in the mix).

What about scores.overall ? My guess would be a DataFrame with a column for each of the named metrics? Or a Series with one row per metric? Then similarly for scores.by_group[group_name] ? If we do that, then should scores.overall always return a DataFrame or Series (even if it's a single element because only one metric was specified in score_groups())?

hildeweerts commented 4 years ago

If we are going to support multiple metrics, I think I would be in favor of always returning a dataframe for the sake of consistency. This will be a bit weird in some cases, e.g. when you access 'overall' if there's only one metric, but it is probably better than returning different types all over the place.

MiroDudik commented 4 years ago

Emerging consensus

Let me try to summarize what I think we are currently converging on! (But please correct me if I'm wrong here.) The actual names of classes / optional arguments are still to be decided, but I'll try to make the API format clear.

Re. grouped metric object constructor and properties

For single metric, we consider something like the following:

class GroupedMetric():
    """
    Object with the overall metric value and metric values in each group.
    """

    def __init__(self, base_metric, *, indexed_params=None, **kwargs):
        """
        Parameters
        ----------
        base_metric : str or callable
            the metric being grouped; if callable this needs to be sklearn style metric
        indexed_params : str[] or None
            names of arguments that take form of arrays and need to be split
            when grouping (e.g., sample_weight)
        **kwargs :
            additional parameters to be passed to base_metric (e.g., beta=0.5 or pos_label=2
            for fbeta_score)
        """

    def eval(self, y_true, y_pred, *, sensitive_features, **kwargs):
        """
        Calculate the metric. After that, the metric values can be accessed in
        the fields `by_group` (dict) and `overall`.

        Returns
    -------
    result : GroupedMetric
            eval method returns `self` to allow chaining
        """

    def difference(self, relative_to='min', group=None, abs=True, agg='max'):
    """
    Difference style disparity (scalar).

        Parameters
        ----------
        relative_to : 'min', 'overall', 'max', 'rest', 'group', or str/int/float
            if none of the string constants are used then the value is interpreted as
            the name/id of the group relative to which the difference is measured, which
            is equivalent to choosing 'group' and specifying the id as `group` parameter
        group: str/int/float
            if relative_to='group' then this parameter provides the relevant group id
        abs: bool
            if True, the absolute value of the differences is taken
        agg : 'max', 'min'
            how group-level differences are aggregated
        """

    def differences(self, relative_to='min', group=None):
        """
        Differences of individual groups rel. to some baseline.

        Parameters
        ----------
        relative_to, group :
            the same as in the method `difference`

        Returns
    -------
    result : pandas.Series
        """

    def ratio(self, relative_to='max', group=None, numerator_choice='smaller', agg='max'):
    """
    Ratio style disparity (scalar). Parameters used to refine the type.

        Parameters
        ----------
        relative_to, group, agg :
            the same as in the method `difference`
        numerator_choice : 'smaller' or 'greater' or None
            if 'smaller' or 'greater' is selected, the ratio is taken with the specified numerator
        """

    def ratios(self, relative_to='max', group=None):
        """
        Ratios of individual groups rel. to some baseline.

        Parameters
        ----------
        relative_to, group :
            the same as in the method `difference`

        Returns
    -------
    result : pandas.Series
        """

    def max(self):
        """
        The maximum group-level metric.
        """

    def min(self):
        """
        The minimum group-level metric.
        """

    def to_frame(self):
        """
        Convert to pd.DataFrame (grouped metrics) and pd.Series (overall metric)

        Returns
    -------
    by_group, overall : pdDataFrame, pd.Series
        """

Remaining questions

Does it makes sense to separate out eval from __init__?
Do we want to exclude abs and numerator_choice from differences and ratios?
Do we need to_dict? Note that the evaluated GroupedMetric has the fields by_group and overall.
Any other issues with this?

Discussion in progress

Re. multiple metrics

I see a lot of utility, but I don't think we should enable multiple metrics for now (but definitely in future). There are two complications that need solving:

Optional metric arguments: Some arguments only apply to one metric (think fbeta_score)
- Maybe do something like: https://scikit-learn.org/stable/modules/compose.html#nested-parameters
Additional columns: Things get more complicated if we want to also measure things like error bars for each group (and now also for each metric?).
- Maybe do something like: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

Re. predefined fairness metrics

[I'll comment on this one later on.]

hildeweerts commented 4 years ago

Re. grouped metric object constructor and properties

Thank you for this! I think it summarizes the conversation so far very well.

Does it make sense to separate out eval from __init__? I think this fits well with the sklearn API in separating the settings from the data. I don't think sklearn has something similar we can copy the naming from, or does it?
Do we want to exclude abs and numerator_choice from differences and ratios? I'd say it makes sense to include them in differences and ratios for the sake of consistency. Btw, it took me some time to understand the meaning of the numerator_choice argument. I suppose this is just a bit less natural than abs so we should make this very clear in the docs. I also want to reiterate that we should be very clear about which group is the subtrahend/denominator (i.e. either the subgroup or the relative_to group).
Do we need to_dict? Note that the evaluated GroupedMetric has the fields by_group and overall. I assume you mean to_dict with a similar API as to_frame, right? I don't have a strong opinion on this. It might be nice for consistency, but also slightly redundant.

Re. multiple metrics

Is there a specific reason you want to handle 'static' metric arguments (like beta in fbeta_score) within GroupedMetric? Of course for 'dynamic' data-dependent arguments (a la sample_weight) this is required, but for static ones, you could also define a wrapper (e.g. similar in spirit to make_scorer)? In that case we could save **kwargs for the dynamic arguments. I am a bit worried about backward compatibility. A single-metric GroupedMetric would have different input (str/callable VS list of str/callable) and output (pandas series VS pandas dataframe) than the multiple-metric version. So switching will definitely break people's code, whereas adding support for handling metric arguments would not necessarily have that effect.

I have to say that error bars would be on my 'nice-to-have' list rather than 'must-have' list. In fact, if we allow for multiple metrics, you could define a callable that computes e.g. the lower error bar and pass it as a 'metric'. But perhaps I just don't understand exactly what you mean? What part of cross_validate are you referring to exactly?

riedgar-ms commented 4 years ago

That is a pretty big move away from what we've had to date - basically a metric_fn_factory() which generates the function which users would actually call. Looking again, I realise that it is consistent with @hildeweerts ' original score_groups() function; it's just that my attention had been caught be the 'return' of GroupMetricResult. I am a little concerned that this isn't the pattern used by scikit-learn.

As for the proposal itself.....

The matter of __init__() and eval() comes down to the expected use pattern. Do we envisage users calling eval() more than once with different data? Related to this is where the **kwargs should go. If we want to support multiple calls, then I don't think that __init__() should have **kwargs at all. The only arguments to the constructor should be the underlying function and the list of index_params. If we don't allow repeated calls, then the only arguments to eval() should be y_true, y_pred, sensitive_feature and the arrays listed in index_params. Anything else should be an error, and the **kwargs on __init__() should be made into properties (with a trailing underscore) similar to how SciKit-Learn does it for Estimators
I agree that abs and numerator_choice make sense for differences() and ratios()
No opinion for to_dict()
Other issues:
1. Do we also want argmin and argmax properties on the object
2. I think that allowing relative_to to be anything other than the defined list of string constants is a bad idea. At some point, someone is going to have a subgroup called "rest" or "group" and then we have an ambiguous situation
3. I think that group= should only accept an int or string. Accepting a float is asking for trouble
4. I would like an extra property to give the name (e.g. sklearn.metrics.recall_score) of the underlying metric. I expect that would be useful for dashboard serialisation

hildeweerts commented 4 years ago

That is a pretty big move away from what we've had to date - basically a metric_fn_factory() which generates the function which users would actually call. Looking again, I realise that it is consistent with @hildeweerts ' original score_groups() function; it's just that my attention had been caught be the 'return' of GroupMetricResult. I am a little concerned that this isn't the pattern used by scikit-learn.

Could you expand a bit more on this? I'm not sure if I understand what you mean but it seems an important comment.

Related to this is where the kwargs should go. If we want to support multiple calls, then I don't think that init() should have kwargs at all.

In the current proposal the **kwargs in __init__() refer to 'static' arguments that are passed to the metric, no? These are not data-dependent, so I think it actually makes sense to have them in __init__ as it defines the properties of the underlying metric.

I agree with your concerns regarding relative_to & adding the property of the underlying metric! I don't have a strong opinion on group=, apart from the fact that as a user I'd be annoyed that I'd have to convert my accidental floats to ints, but that's just because I'm lazy with types ;)

riedgar-ms commented 4 years ago

To the first point, right now our group metrics are ultimately generated by `make_metric_group_summary(). This takes in a metric function in the SciKit-Learn pattern, and returns a callable object (so it looks like a function to the user). So this is what lets us do:

>>> from fairlearn.metrics import recall_score_group_summary
>>> type(recall_score_group_summary)
<class 'fairlearn.metrics._metrics_engine._MetricGroupSummaryCallable'>
>>> y_t = [0,1,1,0,1,1,0]
>>> y_p = [0,1,0,1,1,0,0]
>>> s_f = [4,5,4,4,4,5,5]
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f)
{'overall': 0.5, 'by_group': {4: 0.5, 5: 0.5}}
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f, pos_label=0)
{'overall': 0.6666666666666666, 'by_group': {4: 0.5, 5: 1.0}}
>>> recall_score_group_summary(y_t, y_p, sensitive_features=s_f, pos_label=1)
{'overall': 0.5, 'by_group': {4: 0.5, 5: 0.5}}

We generate the functions like recall_score_group_summary() in _metrics_engine.py.

This is then related to the **kwargs in __init__() - with the current pattern, the user thinks they have a function, so they expect to pass those arguments every time. Does that explanation make sense @hildeweerts ?

I suppose that we could use __call__() rather than eval() but then having the 'static' arguments fixed in the constructor would still be a change from SciKit-Learn semantics. Not necessarily wrong but we should be clear about making such a change.

On group= accepting a float... I've just done some quick experimenting, and a[1.5] fails when a is a list and an ndarray. I would not be keen on defining our own semantics for the type conversion. Of course, this might just be due to my own quixotic tendencies around getting Python to respect types. And there is also the painful "When is an integer not an int? When it's a numpy.int64!" issue if we were to do the type check ourselves.

amueller commented 4 years ago

I know I'm very late to the game and I haven't caught up on everything, but I wanted to point out some of the issues with the sklearn metric function interface that you might face. For example, the meaning of pos_label is not entirely obvious and there's confusion about it even among the core devs. I was a strong advocate of the scorer interface, though it also has issues.

You might decide that it's ok to keep the established sklearn-like interface and live with the shortcomings, I just wanted to mention them briefly:

When using the metric function interface (like recall_score), the user needs to know whether the model takes a score or a prediction, and users often pass a prediction when they should have passed a score (i.e. calling roc_auc_score with y_true, y_pred). Many published results are wrong because of this. The "scorer" interface (that cross_val_score etc use internally) was supposed to fix that.
The actual classes are unknown. For example, if you do macro-averaging, the result will depend drastically on whether all classes are present in the test set. The metric has no way to verify that (in many cases).
The correspondence between class labels and scores is implicit. So if you have classes 'a', 'b', 'c', and a probability vector of shape (n_samples, 3) which column corresponds to which class? Sklearn defaults to lexical ordering, but some third party estimators might not adhere to that.
For asymmetric binary metrics, there's no way to specify the semantically positive class. Which class is positive in recall_score is not really determined by pos_label but by a heuristic internal to sklearn. If the classes are True, False, "True" will be postitive, if they are -1, 1 then 1 is positive, if they are 0 and 1, 1 is positive. In all other cases, the lexically larger class is positive (IIRC). What pos_label does is for a model that takes a score (like hinge_loss) determine which of the class labels corresponds to a high value in the score.

At least 2 and 3 are relatively easily fixed by passing around all known classes, as computed on the full dataset. You might also want to do that for the protected attribute, as I assume you'll otherwise run into similar issues there. This is particularly an issue if you do cross-validation and not all values of a protected attribute are present in all folds. Then the metrics between the folds and between training and test set will be incomparable (and the user won't know).

In some cases, sklearn will currently produce wrong / misleading results bases on the issues above (for example when using a non-stratified split for classification and using macro-averaging. Though I'm not even sure what the correct result should be in that case).

amueller commented 4 years ago

Actually, I stand corrected, what pos_label means is dependent on the metric: https://github.com/scikit-learn/scikit-learn/pull/17704 and I guess only one of the core devs understands it lol.

riedgar-ms commented 4 years ago

@amueller I can see that we need to warn users that the exact meaning of the extra arguments they pass in (such as pos_label) is going to depend on the meaning given by their underlying metrics method. I'm not quite seeing how that could blow up the driver method we're discussing here.... what I am missing?

amueller commented 4 years ago

@riedgar-ms I guess 4) is not as much of an issue, but I think the other items might be.

riedgar-ms commented 4 years ago

I was speaking to a colleague (who I can't @ mention, for some reason), and he raised the point in (2) - and also wondered what would happen if some of the groups lacked all the classes. He also raised a point about whether we want to handle multiclass explicitly.

gregorybchris commented 4 years ago

I'll chime in and hopefully make myself @-able. While I don't think I have anything truly novel to add to this conversation, I would like to attempt to lay out some of the design questions raised more explicitly.

A. Wrapping sklearn vs. injecting sklearn:

Fairlearn has already taken the stance that it will be unopinionated about the implementations (or even interfaces) of metrics that it wraps. The points @amueller raises all get at this question of whether Fairlearn should try to handle the validation and edge cases that sklearn fails to handle. There will be a tradeoff here between customer experience, Fairlearn codebase maintenance, and compatibility with user-defined metrics.

B. Results as an internal representation vs. open source representation:

This is about what is returned by Fairlearn when fairness metrics are computed, the intermediate representation of metrics disaggregated by groups. The tension here seems to be between some Fairlearn-specific representation of results/scores and an open source representation, specifically some combination of Pandas dataframes and Python dictionaries. While an internal representation allows for method chaining of Fairlearn-specific aggregators, an existing representation gives the user more freedom to aggregate and inspect the results with their own tools.

I should point back to the concerns of @hildeweerts and others that we want to make it easy for users to hook into this representation if they choose to.

C. Aggregation in one step vs. two steps:

Another question is about how aggregation should be implemented. This is somewhat dependent on design question B. There will be Fairlearn-specific aggregators that Pandas doesn't support out of the box. In one design, aggregators are built into the metric computation itself so that only the aggregation is returned from metric computation and there is only one fairness metric interface. On the other hand, to separate the concerns of metric computation and aggregation and to simplify the computation interface, there could be separate aggregators that have an implicit/explicit contract through the intermediate disaggregated representation.

The proposal from @MiroDudik around method chaining vs an aggregate argument is the main discussion here.

D. Metric functions vs metric names: (convenience)

Fairlearn can either remain unopinionated about metric implementations or it can accept metric names as a convenience to the majority of users who would default to sklearn metrics anyway.

E. Single metrics vs. multiple metrics: (convenience)

Users will get results for more than one metric at the same time on the same data. That loop over metrics can either be left to the user, simplifying the Fairlearn interface, or provided as a convenience function that accepts a list/set of metric functions/names. The results format for multiple metrics is another matter.

I won't pretend that this is a full summary of the nuanced conversation above, but I did want to start to organize these different threads. Is this list missing any important decisions?

gregorybchris commented 4 years ago

Maybe one question I missed is how multiple sensitive features should be handled. If each sensitive feature has a different number of groups, that might not be best represented as a pandas dataframe, but potentially a dictionary of feature name -> dataframe?

riedgar-ms commented 4 years ago

I think that some of these discussions are mixing up the various levels of the API, and that's adding to the confusion.

Right now, there are basically three layers to the metrics API:

The underlying group_summary() subroutine which applies a user-supplied metric function and applies it to its y_true and y_pred arguments according to the groups specified by the sensitive_features argument.
Our convenience wrappers, such as recall_score_group_summary(), which are wrapper functions (well, callable objects) which bind the named underlying metric into a call to group_summary() - the returned object is simply that from group_summary() (and currently a Bunch).
Our convenience aggregators, such as recall_score_group_min() which provide take the same arguments as the corresponding _group_summary() call, but perform some operation on the Bunch to return a scalar (thereby making them useable as SciKit-Learn Scorers, for use in pipelines)

I think (please correct me if I'm wrong), that our main concern is how to handle the aggregators, which in turn depends on the exact type returned by group_summary(). Everything above is for sensitive_features being a single column. I'm actually starting to think a bit more about how to handle multiple sensitive features, but I'm not quite ready to share all those thoughts.

The group_summary() function

I would like to keep this function more-or-less as-is. The main question is the return type. I would like to propose that, rather than a Bunch it should return a pandas Series, with the name taken from the underlying metric function, and the rows indexed by the unique values of the sensitive feature (use a MultiIndex for multiple sensitive features). I raised the issue of overall before, but @gregorybchris pointed out that it would Pythonic to have a compute_overall argument to group_summary() which if True, would change the result type to be a tuple, with the overall metric result as the first value, and the pandas Series the second.

For the most part, group_summary() should be completely agnostic about what the underlying metric does, and the type returned (not to mention the actual contents of y_true and y_pred - they may not be simple lists of classification results). If there are errors from the underlying metric, then group_summary() should just propagate them, and not attempt to 'fix' the issue.

However, as @amueller and @gregorybchris have both pointed out, things can go wrong if the subgroups identified by the sensitive features are missing classes in their y arrays (this could become particularly acute in the multiple sensitive feature case). For example, if the underlying metric is a confusion matrix, then missing classes could result in the individual matrices in each row of the Series having different sizes. Or worse, if (for example) females always get predictions of 'A' or 'B' but males always get 'C' or 'D' then the confusion matrices will be the same size.... and completely non-comparable.

The Convenience Wrappers

I think the main issue here is whether these should exist or not. We could allow users to call group_summary() with a string in place of a metric function (where we have an internal mapping of strings to actual functions), and not have these at all.

The Convenience Aggregators

If group_summary() returns a Series, I think that a lot of the aggregations would be available to data scientists through familiar operations? For any extras we wanted to add - for example, result.differences(relative_to='group', group='A', abs=True) we could subclass Series, and only add the methods we need (the result of this call would be another Series). If we preferred not to subclass, we would instead provide a standalone differences(series, relative_to, ....) function which returned another Series.

One downside to this is that to have relative_to='overall' the user would have to provide the value of the metric for the entire dataset (they could get the value by setting compute_overall=True when calling group_summary() but they wouldn't then be able to chain the entire call).

amueller commented 4 years ago

Sorry, I'm still catching up, but do you allow metrics that require continuous scores, such as average_precision_score? One of the reasons we have (at least) two different interfaces for metrics (the functions and the scorers) is that in some cases, like gridsearchcv, we need to automagically find out whether to call predict, decision_function or predict_proba.

It look like you're mostly building on the function interface right now, and I was wondering if this issue came up at all.

MiroDudik commented 4 years ago

@amueller all of our current fairness metrics act on (y_true, y_pred, sensitive_features). They do not take an estimator / predictor itself as an argument. So it's right now up to the user to call the relevant function.

amueller commented 4 years ago

@MiroDudik this will not work when used inside cross-validation, unfortunately. Though passing the sensitive feature during cross-validation might be a bigger issue.

MiroDudik commented 4 years ago

The idea was to use them in combination with make_scorer. When it comes to sensitive_features, the idea was to use Adrin's hack that handles sample_weight until sklearn is updated.

amueller commented 4 years ago

That makes sense.

riedgar-ms commented 4 years ago

One small comment.... the meaning of y_true and y_pred is actually set by the metric function being passed in. So long as they are arrays of equal length (along with the sensitive feature), they just get sliced up and passed into the metric. But as @MiroDudik says, we expect them fully precomputed.

MiroDudik commented 4 years ago

Let me try to regurgitate items again, indicate where we stand on each item, and add a couple of new items. I'll be presenting this as a list of questions. Hopefully, it will make it easier to refer to, track, and add to it as needed. Please respond below!

Q1: Should we represent metrics as functions or objects?

Our status quo was to have only functions with the signature (y_true, y_pred, *, sensitive_features, **kwargs), which were returning simple types (scalars or Bunch). We also had factories to support common patterns of taking a "standard" metric of the form (y_true, y_pred, **kwargs) and turning it into a fairness metric (based on disaggregation into subgroups) and we had many predefined variants, mostly based on sklearn metrics. (The predefined metrics are actually not well documented, but you can see their names and the results that they return here and here.)

@hildeweerts pointed out that when tuning hyperparameters or monitoring fairness, it is useful to have functions returning scalars, but when working with various variants of disaggregated metrics, it is much more natural to have an object that feels more like a pandas.Series.

So the consensus seems to be that we should have both objects and functions.

Q2: What should the metric objects look like?

The purpose of metric objects is to evaluate and store the results of disaggregated metrics. Inspired by pandas.Series, we landed on the following near-consensus proposal of what this object should look like. The status of follow-up questions is as follows:

eval/__init__: (Unresolved)
abs and numerator_choice should be included in differences and ratios for the sake of consistency. (Consensus)
we don't need to_dict. (Consensus)
other questions by @riedgar-ms :
1. do we need argmin and argmax? (I don't think they add much value and they are pain to deal with, because they are often singletons, but sometimes sets.)
2. relative_to should only use string constants. (Consensus)
3. group= should only accept an int or string. (Consensus)
4. we should have fields with meta-information such as base_metric=sklearn.metrics.recall_score. (Consensus)

Q3: Should the metric objects subclass pandas.Series?

[Implicitly brought up by @riedgar-ms; this is an updated version of the comment. Another update: Richard does not think we need this, so let's resolve the answer as No.]

This seems interesting. I'm worried about the implementation complexity, maintenance overhead (deep dependence on another package), the semantics of mutable Series operations like updates/additions/deletions of entries (how would this impact overall), and also it is not clear to me what's the benefit over just using to_frame().

I'm not sure what the exact proposal looks like, but one variant that I can imagine based on @riedgar-ms's description above is as follows. Consider calling

grouped = GroupedMetric(metric_function).eval(y_true, y_pred, sensitive_features=sf)

This call would return an immutable pandas.Series or pandas.DataFrame, extended with the following methods:

overall(): return the overall value of the metric
differences(...), difference(), ratios(...), ratio(): these would act exactly as in the proposal above (see Q2)

Ignoring implementation/maintenance complexity, I think that key is to make things immutable. Otherwise, I'm worried about semantics of overall() and associated functionality.

Q4: Do we need group_summary()?

(Implicitly brought up by @riedgar-ms.)

I don't think so. I think it is now effectively replaced by the constructor of the metric object. This is because we could define group_summary() as follows (ignoring some details like indexed_params):

def group_summary(metric_function, y_true, y_pred, *, sensitive_features, **kwargs):
    return GroupedMetric(metric_function).eval(y_true, y_pred, sensitive_features=sensitive_features, **kwargs)

Q5: What should the predefined metric functions look like?

I think that we should keep the same signature of all the predefined functions as the status quo:

<metric>_difference(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_ratio(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_group_min(y_true, y_pred, *, sensitive_features, **kwargs)
<metric>_group_max(y_true, y_pred, *, sensitive_features, **kwargs)

But they would all be defined via the grouped metric object. For example,

def accuracy_score_difference(y_true, y_pred, *, sensitive_features,
                              relative_to='min', group=None, abs=True, agg='max', **kwargs):
    return GroupedMetric(skm.accuracy_score).eval(
        y_true, y_pred, sensitive_features=sensitive_features, **kwargs).difference(
        relative_to=relative_to, group=group, abs=abs, agg=agg)

Since I'm proposing to drop group_summary, I definitely don't think that we should have <metric>_group_summary. Instead we could allow to provide the names of the base metrics from predefined functions as arguments to GroupedMetric, e.g., GroupedMetric('accuracy_score').

Q6: Should we support strings in GroupedMetric?

I think we should allow strings for all the base metrics which appear in the predefined metrics and possibly some additional common cases.

Q7: What do we do about quirks and limitations of sklearn metrics?

As @gregorybchris said, we've decided to make all the arguments pass-through (modulo disaggregation according to sensitive_features). The assumption (possibly wrong!) is that the users understand the quirks of the sklearn metrics and for the sake of consistency, we replicate those quirks. My personal take is that if we want to fix those issues we should either:

try to fix them in sklearn, or
use an alternative metric name for the fixed metric that does not conflict with sklearn; maybe we could use <metric>_alt for the fixed version of <metric>?

Q8: What should be the invocation pattern for evaluating multiple metrics?

As @gregorybchris said, evaluating multiple metrics is just a matter of convenience, but it's probably a common use case. I'd like to figure the calling convention not the return type.

Calling Convention 1

grouped = GroupedMetric([metric1, metric2]).eval(
    y_true, y_pred, sensitive_features=sf, args_dict={metric1: kwargs1, metric2: kwargs2})
# or
grouped = GroupedMetric([metric1, metric2], args_dict={metric1: kwargs1, metric2: kwargs2}).eval(
    y_true, y_pred, sensitive_features=sf)

Calling Convention 2

This was alluded to by @riedgar-ms above and is also similar to AIF360 pattern:

grouped_pred = GroupedPredictions(y_true, y_pred, sensitive_features=sf)
grouped1 = grouped_pred.eval(metric1, **kwargs1)  # a GroupedMetric object for single metric
grouped = grouped_pred.eval([metric1, metric2],
    args_dict={metric1: kwargs1, metric2: kwargs2})  # a GroupedMetric object for two metrics

Q9: Do we want to include any of the following extensions in this proposal?

There are several extensions that are worth considering that impact the format of an evaluated GroupedMetric object. This proposal is getting quite involved, so I suggest that we do not consider any of them right now (at least not in the full detail). That said, I would still like to capture them, so we can keep them in mind to ensure that our proposal does not create frictions for these extensions in future.

Additional column(s) defining "legitimate" risk factor(s). The use case here is conditional statistical parity that corresponds, e.g., to the requirement that "among defendants who have the same number of convictions, white and black defendants are detained at equal rates".
Error bars and other information augmenting metrics. This is possibly quite important use case that has been repeatedly requested. Besides measuring a metric, it would be nice to also provide error bars on its value.
Multiple columns in sensitive_features. [Convenience.] There are two semantics here. Either each sensitive feature is considered separately (e.g., split separately into gender groups or race groups) or jointly (intersectional groups defined by gender + race). One of the two would need to be indicated by a separate boolean argument.
Multiple metrics. [Convenience.] In this proposal, I'd like to figure the invocation pattern (see Q8), but not necessarily the return type of GroupedMetric with multiple metrics.

I think that all of these can be handled by hierarchical indices (i.e., pandas.MultiIndex) along rows (for conditioning and sensitive features) and/or columns (error bars, multiple metrics). The main implication is that by_group and overall fields of the GroupedMetric object should be of type pandas.DataFrame rather than just a dictionary or a pandas.Series.

riedgar-ms commented 4 years ago

OK, that has sort of cleared some things up. Definitely ignore my comment about subtyping Series - that was based on a misunderstanding of what some things were doing (and even ignoring that, the more I thought about the idea, the less I was liking it anyway).

If we're going to make metrics into objects, I think that question 8 is actually the big one. Looking at the single metric case, there are two basic options:

Option 1: GroupedMetric holds functions

In this scenario, a user would write something like:

accuracy = GroupedMetric(sklearn.metrics.accuracy_score, index_params=['sample_weights'], pos_label="Approved")
accuracy.eval(y_true, y_pred, sample_weights=sw)
print(accuracy.by_groups)
>>> { 'male':0.7, 'female':0.5 } # Probably a DataFrame or Series rather than a dict()
print(accuracy.overall)
>>> 0.6
print(accuracy.function_name)
>>> 'sklearn.metrics.accuracy_score'

and so on. Some open questions:

Should we provide a pos_label_ attribute so users can remember what they set it to (and similarly for any other knobs and dials for the metric function which they set in the constructor)?
If we do provide those attributes, should they be mutable?

Option 2: GroupedMetric holds data

This essentially reverses the constructor and eval() arguments:

my_data_for_metrics = GroupedMetric(y_true, y_pred, sample_weights=sw)
my_data_for_metrics.eval(sklearn.metrics.accuracy_score, index_params=['sample_weights'], pos_label='Approved')
print(my_data_for_metrics.by_groups)
>>> { 'male':0.7, 'female':0.5 } # Probably a DataFrame or Series rather than a dict()
print(my_data_for_metrics.overall)
>>> 0.6

And so on. Again we have the question of whether y_true etc. can be updated after they are set in the constructor.

If I'm still misunderstanding things, please let me know. I think that the above is a big decision, on which we need to be clear.

Other thoughts

For Question 7, I think the best option is to say that users need to know their tools. @gregorybchris can correct me if I'm wrong, but I believe that to 'save users from themselves' we would need to figure out the inner workings of the metric functions we're wrapping in great detail, and that would wind up tying us to particular versions of SciKit-Learn. Even then I'm sure we'll miss some extreme edge cases (and we wouldn't be able to do anything for users who brought their own metric functions - and they then might be even more shocked that the magic we were doing for the SciKit-Learn metrics didn't work on their own).

For Question 9(2), isn't providing an error estimate (or similar) the job of the underlying metric function? How would we provide that? Although we talk about y_true and y_pred being lists of scalars, we don't actually require that. They can be lists of anything, with the meaning of 'anything' imposed by the underlying metric function itself. Similarly, the metric function can return whatever it likes, and we stick it into the result structure. We already support confusion matrices in that way. If the metric wants to return an object containing both the value and the error bars, more power to it (although the aggregation methods we're planning on supplying are unlikely to work).

Not to be settled now, but for Question 9(3 & 4), I suggest we start thinking in terms of DataFrames with the columns being the underlying metrics, and have a MultiIndex on the rows for the intersections of groups.