Formalize various scenarios of "conflict" in data

PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications

GNU Affero General Public License v3.0

8 stars 2 forks source link

Formalize various scenarios of "conflict" in data #1176

Closed slifty closed 1 month ago

slifty commented 2 months ago

The intention of this issue is to create a table of "scenarios" of different pieces of data coming from different sources, and "ideal" resolutions for that conflict. It may be that our implementation already covers the cases, but this will make it easy for us to evaluate (and maybe eventually write tests against).

There are a few scenarios that were already written out in #1087.

bickelj commented 1 month ago

From the README:

Track the provenance and update history of all information, noticing and handling discrepancies. For example, if two different GMS tools connect to the PDC and provide conflicting information about an application or an applicant, the PDC may be able to pick the right answer automatically (based on a up-to-date date or on some other precedence rule), or it may flag the conflict and require a human to resolve it.

We are now focused on applicant/organization/changemaker/grantseeker data, which can come from proposal data, but we're not picking and choosing which proposal as such is the best version of a proposal. We're looking for conflicts within proposals that are about an applicant/organization/changemaker/grantseeker.

From issue #1087:

Two proposals come in around the same time under the same EIN, but with different organization addresses.

Separated by a year, two proposals come in under the same EIN, but with different organization addresses.

A proposal and a third party data provider have provided conflicting organization addresses.

A user has manually / directly provided a "corrected" organization address, a proposal comes in one year later with the > previous / old organization address.

A user has manually / directly provided a "corrected" organization address, a third party data provider has provided a new organization address that is significantly different any previous value.

A proposal has indicated an annual operational budget of $4m a year, the next year a proposal indicates an annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.

A proposal has indicated an annual operational budget of $4m a year, the next year a direct manual entry updates the annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.

bickelj commented 1 month ago

Taking a step back and imagining how conflicts start, I can think of these categories and examples:

Staleness. A changemaker’s office moved. The budget is from last year.
Entry or processing error. A typo. Some of the data in a field is culled due to different length constraints.
Missing or extra fields. The CEO name is missing from some sources. A contact phone number is included in some sources.
Misrepresentation. A changemaker intentionally changed some facts when submitting to a funder to make an application more attractive to that funder.
Differing field interpretation. A data platform provider mapped a field to PDC base field X while a GMS mapped a semantically differing field to PDC base field X.

I saw a summary online mentioning "syntactic" versus "semantic" differences. In the above, (2) would be closest to "syntactic" while (5) would be "semantic." Another source mentions three different ways that (5) could happen.

bickelj commented 1 month ago

What about the ways to resolve conflicts? I can think of these categories, descriptions, and examples:

Temporal. The latest value wins. The address from proposal B wins because proposal B was posted later than proposal A.
Spatial. The largest value wins. The changemaker mission statement from proposal B wins because it is longer than that of proposal A.
Frequency. The most prevalent value wins. Proposals A and C have the same address and that one wins because proposal B is in the minority with a different one.
~~Owner. The data “owner’s” value wins. The annual budget submitted by the changemaker trumps that of one submitted by a funder or data platform provider.~~ Edit: collapsed into (5) Authority, below.
Authority. The source closest to the authoritative value wins (if the authoritative source is a third-party system, that one wins). The annual budget submitted by the changemaker trumps that of one submitted by a funder or data platform provider. An automated call to an IRS web API matches one of the values found in a proposal and that one wins.
Rater. Some person(s) decide. In a (future) PDC UI, people with sufficient authorization are able to mark the “correct” value.

Any of the first several, and perhaps combinations, have a chance at being automated.

I suppose any of these could be applied to any of the above conflict origins, except I expect only people (e.g. 5) will be able to resolve the differing field interpretation issue.

The purpose of thinking of these resolution strategies is to help guide the creation of a small but diverse set of scenarios. I think we each of the above resolution strategies included in at least one of the scenarios.

bickelj commented 1 month ago

A data conflict arises in any number of ways but is revealed relatively simply with a database query that groups by changemaker, base field, and value. If there is more than a single row for a changemaker's base field, there is a conflict. So conflict detection doesn't need any more elaboration, I don't think, except perhaps for how to render it to a user in a UI (future).

In order to have specific scenarios, we need to specify the base field in each scenario. In order to have different resolution strategies, we will have to apply them to different base fields. Otherwise we could have a single resolution strategy across all fields. For example, one single strategy could be: Rater, failing that, Authority, failing that, Owner, failing that, Frequency, failing that, Spatial, failing that, Temporal. The single universal strategy could apply uniformly to all fields, with differing results based on the presence or absence of raters, authorities, or owners, etc. But reading between the lines, the gist of this issue is that different fields might require different resolution strategies. In any case, we don't have to think as much about the resolution scenarios, but I think a diverse set of base fields across scenarios helps.

bickelj commented 1 month ago

Here is a compact list of some formal scenarios based on the above reasoning, with examples of chosen resolution strategies:

data_conflict_scenarios

Same but more friendly to paste into your favorite spreadsheet software here:

"Base Field","Source A","Source B","Source C","Differences","Winner","Strategy"
"organization_name","Funder’s GMS X","Data Platform Provider Y","Applicant","A was posted later, B is longer, C is shorter and posted earlier","C","Owner"
"organization_phone","Data Platform Provider V","Data Platform Provider W",,"A was posted earlier, B was posted later","B","Temporal"
"organization_website","Funder’s GMS S","Funder’s GMS T","Funder’s GMS U","A and B agree, C differs","A","Frequency"
"organization_mission_statement","Data Platform Provider Z","Data Platform Provider R",,"The value of A is contained in the value from B, but B is longer.","B","Spatial"
"organization_tax_status_date","Applicant","Data Platform Provider Q","External IRS API","The values from A and B differ, a lookup to C matches B","B","Authority"

@slifty Is this anything close to what we require here?

kfogel commented 1 month ago

@bickelj I appreciate the clarity of thinking & of communication here!

What is the difference between "Owner" and "Authority"? And in the last row, about applicant_tax_status_date, why does source B (the Data Platform Provider) win? I would have thought that either A (the Applicant) or C (the External IRS API) would be more authoritative on this question than the DPP. Indeed, it's an interesting question whether A or C should be considered more authoritative... but in any case source B would be less authoritative than either of them, I would think.

More generally, though, I'm not clear on the difference between ownership and authority here. One could say that the IRS is the authority on the question of some org's tax status date; one could also, with a little semantic stretching, say that the IRS "owns" that information, in the sense that they determined it and they gave it the org. Is there really a difference between these two concepts?

/CC @slifty

Actually, let me also CC @jmergy here. Jonathan, I highly recommend glancing over Jesse's comment above, which is more important than my reaction to it here (then if you have time and want to follow in detail, my question here might be interesting, but it's not crucial to understanding the general framework we're using -- Jesse's original comment above is the right place to get that).

bickelj commented 1 month ago

@kfogel Good point about authority versus owner. There may not be a real difference here. The distinction was maybe more procedural than semantic. In other words, I was thinking of "we have the owner's data already" versus "we are looking up data separately from our system entirely." Or perhaps "changemaker owner" versus "external owner" would be the distinction. They can probably be collapsed into one.

As far as the last example, where I have source B instead of C winning, you raise an interesting point. I had B win on the assumption that we would only choose a best value that has already been supplied to PDC (from either A or B) due to B matching C. But what if A differs from B, so we look up externally in C, and then C differs from both? Perhaps we would not have looked up in C had there been no difference. Or perhaps we should always check certain values in C every time. I don't know the answer. I lean toward doing automated lookup every time if we have an automated lookup capability.

bickelj commented 1 month ago

In my table I distinguished "Applicant" from "Funder" but it would be unusual for applicants/changemakers to directly send data to PDC. That boils down to the funder being a proxy for the applicant/changemaker in that scenario and comparing to a data platform provider rather than another funder.

bickelj commented 1 month ago

Based on feedback and reflection, here is an updated version of similar scenarios: data_conflict_scenarios_2 Here is raw CSV to make it easier to paste into spreadsheet software:

"Base Field","Source A","Source B","Source C","Differences","Winner","Strategy","Reasoning"
"organization_name","Data Platform Provider X","Data Platform Provider Y","Funder’s GMS Z","A was posted later, B is longer, C is shorter and posted earlier","C","Authority","The applicant is the authority and posted the data to the GMS."
"organization_phone","Data Platform Provider V","Data Platform Provider W",,"A was posted earlier, B was posted later","B","Temporal","Neither platform provider is more authoritative than the other, so we resort to time."
"organization_website","Funder’s GMS S","Funder’s GMS T","Funder’s GMS U","A and B agree, C differs","A","Frequency","None of the GMSes is more authoritative than the other, so we resort to the most commonly found value."
"organization_mission_statement","Data Platform Provider Q","Data Platform Provider R",,"The value of A is contained in the value from B, but B is longer.","B","Spatial","Neither platform provider is more authoritative than the other, and there is a special case of a longer value in one but otherwise matching data, so take the longer one."
"organization_tax_status_date","Data Platform Provider P","External API O",,"The value from A differs from B","B","Authority","Given the ability to look up tax data via external API call, always check tax data against the authoritative source."

slifty commented 1 month ago

On call today @kfogel confirmed that this is in a reasonable place (this isn't to say we won't come up with more scenarios in future, or change these scenarios).

The next step is @bickelj will create an issue to capture turning these scenarios into tests.