Closed slifty closed 1 month ago
From the README:
Track the provenance and update history of all information, noticing and handling discrepancies. For example, if two different GMS tools connect to the PDC and provide conflicting information about an application or an applicant, the PDC may be able to pick the right answer automatically (based on a up-to-date date or on some other precedence rule), or it may flag the conflict and require a human to resolve it.
We are now focused on applicant/organization/changemaker/grantseeker data, which can come from proposal data, but we're not picking and choosing which proposal as such is the best version of a proposal. We're looking for conflicts within proposals that are about an applicant/organization/changemaker/grantseeker.
From issue #1087:
- Two proposals come in around the same time under the same EIN, but with different organization addresses.
- Separated by a year, two proposals come in under the same EIN, but with different organization addresses.
- A proposal and a third party data provider have provided conflicting organization addresses.
- A user has manually / directly provided a "corrected" organization address, a proposal comes in one year later with the > previous / old organization address.
- A user has manually / directly provided a "corrected" organization address, a third party data provider has provided a new organization address that is significantly different any previous value.
- A proposal has indicated an annual operational budget of $4m a year, the next year a proposal indicates an annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.
- A proposal has indicated an annual operational budget of $4m a year, the next year a direct manual entry updates the annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.
Taking a step back and imagining how conflicts start, I can think of these categories and examples:
I saw a summary online mentioning "syntactic" versus "semantic" differences. In the above, (2) would be closest to "syntactic" while (5) would be "semantic." Another source mentions three different ways that (5) could happen.
What about the ways to resolve conflicts? I can think of these categories, descriptions, and examples:
Any of the first several, and perhaps combinations, have a chance at being automated.
I suppose any of these could be applied to any of the above conflict origins, except I expect only people (e.g. 5) will be able to resolve the differing field interpretation issue.
The purpose of thinking of these resolution strategies is to help guide the creation of a small but diverse set of scenarios. I think we each of the above resolution strategies included in at least one of the scenarios.
A data conflict arises in any number of ways but is revealed relatively simply with a database query that groups by changemaker, base field, and value. If there is more than a single row for a changemaker's base field, there is a conflict. So conflict detection doesn't need any more elaboration, I don't think, except perhaps for how to render it to a user in a UI (future).
In order to have specific scenarios, we need to specify the base field in each scenario. In order to have different resolution strategies, we will have to apply them to different base fields. Otherwise we could have a single resolution strategy across all fields. For example, one single strategy could be: Rater, failing that, Authority, failing that, Owner, failing that, Frequency, failing that, Spatial, failing that, Temporal. The single universal strategy could apply uniformly to all fields, with differing results based on the presence or absence of raters, authorities, or owners, etc. But reading between the lines, the gist of this issue is that different fields might require different resolution strategies. In any case, we don't have to think as much about the resolution scenarios, but I think a diverse set of base fields across scenarios helps.
Here is a compact list of some formal scenarios based on the above reasoning, with examples of chosen resolution strategies:
Same but more friendly to paste into your favorite spreadsheet software here:
"Base Field","Source A","Source B","Source C","Differences","Winner","Strategy"
"organization_name","Funder’s GMS X","Data Platform Provider Y","Applicant","A was posted later, B is longer, C is shorter and posted earlier","C","Owner"
"organization_phone","Data Platform Provider V","Data Platform Provider W",,"A was posted earlier, B was posted later","B","Temporal"
"organization_website","Funder’s GMS S","Funder’s GMS T","Funder’s GMS U","A and B agree, C differs","A","Frequency"
"organization_mission_statement","Data Platform Provider Z","Data Platform Provider R",,"The value of A is contained in the value from B, but B is longer.","B","Spatial"
"organization_tax_status_date","Applicant","Data Platform Provider Q","External IRS API","The values from A and B differ, a lookup to C matches B","B","Authority"
@slifty Is this anything close to what we require here?
@bickelj I appreciate the clarity of thinking & of communication here!
What is the difference between "Owner" and "Authority"? And in the last row, about applicant_tax_status_date
, why does source B (the Data Platform Provider) win? I would have thought that either A (the Applicant) or C (the External IRS API) would be more authoritative on this question than the DPP. Indeed, it's an interesting question whether A or C should be considered more authoritative... but in any case source B would be less authoritative than either of them, I would think.
More generally, though, I'm not clear on the difference between ownership and authority here. One could say that the IRS is the authority on the question of some org's tax status date; one could also, with a little semantic stretching, say that the IRS "owns" that information, in the sense that they determined it and they gave it the org. Is there really a difference between these two concepts?
/CC @slifty
Actually, let me also CC @jmergy here. Jonathan, I highly recommend glancing over Jesse's comment above, which is more important than my reaction to it here (then if you have time and want to follow in detail, my question here might be interesting, but it's not crucial to understanding the general framework we're using -- Jesse's original comment above is the right place to get that).
@kfogel Good point about authority versus owner. There may not be a real difference here. The distinction was maybe more procedural than semantic. In other words, I was thinking of "we have the owner's data already" versus "we are looking up data separately from our system entirely." Or perhaps "changemaker owner" versus "external owner" would be the distinction. They can probably be collapsed into one.
As far as the last example, where I have source B instead of C winning, you raise an interesting point. I had B win on the assumption that we would only choose a best value that has already been supplied to PDC (from either A or B) due to B matching C. But what if A differs from B, so we look up externally in C, and then C differs from both? Perhaps we would not have looked up in C had there been no difference. Or perhaps we should always check certain values in C every time. I don't know the answer. I lean toward doing automated lookup every time if we have an automated lookup capability.
In my table I distinguished "Applicant" from "Funder" but it would be unusual for applicants/changemakers to directly send data to PDC. That boils down to the funder being a proxy for the applicant/changemaker in that scenario and comparing to a data platform provider rather than another funder.
Based on feedback and reflection, here is an updated version of similar scenarios: Here is raw CSV to make it easier to paste into spreadsheet software:
"Base Field","Source A","Source B","Source C","Differences","Winner","Strategy","Reasoning"
"organization_name","Data Platform Provider X","Data Platform Provider Y","Funder’s GMS Z","A was posted later, B is longer, C is shorter and posted earlier","C","Authority","The applicant is the authority and posted the data to the GMS."
"organization_phone","Data Platform Provider V","Data Platform Provider W",,"A was posted earlier, B was posted later","B","Temporal","Neither platform provider is more authoritative than the other, so we resort to time."
"organization_website","Funder’s GMS S","Funder’s GMS T","Funder’s GMS U","A and B agree, C differs","A","Frequency","None of the GMSes is more authoritative than the other, so we resort to the most commonly found value."
"organization_mission_statement","Data Platform Provider Q","Data Platform Provider R",,"The value of A is contained in the value from B, but B is longer.","B","Spatial","Neither platform provider is more authoritative than the other, and there is a special case of a longer value in one but otherwise matching data, so take the longer one."
"organization_tax_status_date","Data Platform Provider P","External API O",,"The value from A differs from B","B","Authority","Given the ability to look up tax data via external API call, always check tax data against the authoritative source."
On call today @kfogel confirmed that this is in a reasonable place (this isn't to say we won't come up with more scenarios in future, or change these scenarios).
The next step is @bickelj will create an issue to capture turning these scenarios into tests.
The intention of this issue is to create a table of "scenarios" of different pieces of data coming from different sources, and "ideal" resolutions for that conflict. It may be that our implementation already covers the cases, but this will make it easy for us to evaluate (and maybe eventually write tests against).
There are a few scenarios that were already written out in #1087.