The Problem with Metrics is a Fundamental Problem

bhack commented 2 years ago

It is partially connected to https://github.com/chaoss/wg-common/issues/164 and I think that it could also involve WG-risk.

This paper is a little bit oriented to AI but has still a nice general overview on what kind of mitigation we could adopt when we are going to rely on metrics as targets:

https://arxiv.org/abs/2002.08512

bhack commented 2 years ago

Just to add few other resources other then in the original ticket: https://www.holistics.io/blog/four-types-goodharts-law/ https://mpra.ub.uni-muenchen.de/90649/1/MPRA_paper_90649.pdf

bhack commented 2 years ago

https://www.ribbonfarm.com/2016/06/09/goodharts-law-and-why-measurement-is-hard/

bhack commented 2 years ago

https://openai.com/blog/measuring-goodharts-law/

bhack commented 2 years ago

Specially with enterprise opensource software the Org activities are influenced by defined KPI/OKR/Tagets.

So I want to make just two example randomly picking from metrics when they are transformed in targets/goals.

https://chaoss.community/metric-time-to-first-response/

Objectives Identify cade7nce of first response across a variety of activities, including PRs, Issues, emails, IRC posts, etc. Time to first response is an important consideration for new and long-time contributors to a project along with overall project health.

It we transform this metric in a KPI/target:

e.g. reduce 10% the cadace of first respose.

Risks:

The team is going to put a quick response, as fast as possible, just to reduce the time but that is not really useful to the submitter Outcome:
We will improve this metric and we achieve the target/performance but the quality of the first reply is lowered Mitigation:
Quality sampling on the first replies by the QA team or users collected quality feedback

Time to close: https://chaoss.community/metric-time-to-close/

Description The time to close is the total amount of time that passes between the creation and closing of an operation such as an issue, change request, or support ticket. The operation needs to have an open and closed state, as is often the case in code review processes, question and answer forums, and ticketing systems.

It we transform this metric in a KPI/target e.g. reduce 10% the time to close.

Risks:

We could go to early close/reject bugs without a deeper investigation/interaction with the submitter
We could reject PR too quickly
We could merge PR without enough quality review lowering the quality of the code base. Outcome:
We could improve this metric and reach the goal but lowering the relationships with the submitters/contributors
We could lower the code quality Mitigation:
Quality sampling on the closure outcome.
Etc...

...and so on

sgoggins commented 2 years ago

@bhack : The concerns expressed in the paper you linked to in the issue are fairly central our organizing values at CHAOSS. I think as a community we accept the fallibility of any given metric as given. Our community does not try to produce dashboards. Instead, we are generating and refining consistent definitions for what we call discrete metrics, and building common representations of metrics collections using metrics models. (i.e., this working group). So, I think you are in the right place. You will find many of the concerns expressed in your comments reflected in the meeting discussions of the risk working group as you observed.

How can we work together to bring some of your ideas forward as an actionable metrics model?

bhack commented 2 years ago

How can we work together to bring some of your ideas forward as an actionable metrics model?

I'm not familiar with the way the group works as I'm pretty new. I think that on one side we should try to involve some experts on this subject such as the author or related works. Is there any major conference on this issue?

In parallel, when we interact with the OSPOs, we should begin to understand how they direct resources, especially in "enterprise" opensource projects, since resources are often allocated to achieve objectives that for companies must be measurable. What do the OSPOs do? Do they transform our metrics into measurable goals? etc..

bhack commented 2 years ago

/cc @davidmanheim I think we have mentioned two of your works in this thread https://github.com/chaoss/wg-metrics-models/issues/17#issuecomment-1100697795. Are you still involved in this topic?

bhack commented 2 years ago

My proposal is to evaluate some section in the model template using 4.1/5.1 in https://mpra.ub.uni-muenchen.de/90649/1/MPRA_paper_90649.pdf:

Can we add for every single included metric a Pre-gaming analysts of the risk?
Coherence: what is the coherence of our metric model?

In the model (template) we could add a model card section where we define:

Diversification: how we have diversified our metrics portfolio to limit gaming?

Because the different metrics typically require different behaviors, and they will be to some extent in tension with one another, they are likely to make gaming harder.
Randomization. What kind of randomization strategy we suggest?

the weights on compo- nents or the relative rewards are uncertain, gaming the metric may become less worth- while.
What kind of soft-metrics we suggest to integrate when an Org is adopting a specific model?

Human judgment, peer evaluation, and other techniques may be able to reduce gaming specific to metrics. Metrics are often seen as a way to avoid subjectivity, but a combination of metrics and human judgment may be able to capture the best of both worlds.
What limit strategy we suggest?

Limiting Metrics. Failures are often the result of too much pressure on the opti- mization. By using metrics to set a standard or provide a limited incentive instead of a presenting value to maximize, the overoptimization pressure can sometimes be mitigated
Abandoning Measurement. How we are going to review the model after applied on the field?

Sometimes, the value of better incentivising partic- ipants and the potential for perverse incentives issues make it worthwhile to be wary of what Muller refers to as metric fixation.[Mul18] As he suggest, sometimes the best solution is to do nothing - or at least nothing involving measurement.

bhack commented 1 year ago

/cc @dynamicwebpaige (if interested)

germonprez commented 1 year ago

Closing as these issues are more corporately related and less about open source community health.

germonprez commented 1 year ago

Maybe create a blog post or a Discourse thread with respect to this issue.

bhack commented 1 year ago

In communities where we have an heavy presence of corporate members, developers and repo gatekeepers it is going to impact a lot as all our metrics and our models are at risk to be deformed.

chaoss / wg-metrics-models

The Problem with Metrics is a Fundamental Problem #17