PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Think through how we want to identify and present "gold" data from the PDC #1087

Closed slifty closed 2 months ago

slifty commented 3 months ago

As of 2024-08-01, our best understanding is that "gold" data only refers to Organization/ChangeMaker data. In other words, we don't expect folks to query PDC for a "gold" version of a proposal, we expect them to query PDC to get the best data for a given Organization/ChangeMaker.

bickelj commented 2 months ago

I confess I haven't used bulk upload yet, so I don't know all the details of how intermediate data are stored. My views might change after trying that.

In some sense, we already present the "gold" data in the PDC, so the tricky part is how to present "non-gold" or "pre-gold" data, e.g. the CSV file that was uploaded or the individual field responses that need work. I think we already have a start: we present the proposal data that matches the expected form and flag the proposal data that does not.

Then again, the above sense could be thought of as validity and we want gold to mean valid and more, e.g. valid and sound. In that case, we'll need human help to flag gold data separate from mere validity.

bickelj commented 2 months ago

I suppose we are looking not just at proposal-specific data but organizational data as well. We may need to compose a view of an organization across several proposals and/or other posts of data. The simplest is probably "latest post/put to PDC wins" but that is obviously incorrect. Old or stale data can by synced from a system so "created in PDC" should be a last resort. We probably should ask posters of data to include the last updated timestamp from their system and keep that in PDC. Even the reported last updated timestamp is crude, though, because time has nothing to do with the quality of the data. It depends more on who made the update. For example, if someone from organization B updated records for organization A, that could be considered worse than if someone from organization A updated records for organization A. Then again someone from organization A might input malformed data. This gets complicated fast but boils down to identifying some person or group of people as being at the top of a priority list for calling a record good. That and formal validity of the data. Something like:

  1. Are the data valid? If so, consider it potential gold.
  2. Are the data recent? If so, consider it potential gold. (Or don't consider dates at all with regard to goldness)
  3. Were the data updated by a known and expected party? If so and the above are true, call it gold.

There may be some fields that we can corroborate with automated checks, such as searching an EIN in some system, which might be good, but that does not generalize to all fields. We should have some date on all fields and should have some validity check on all fields, however meager. And we should have some way to track the provenance of the data (see #1083) such that we can judge the third point above.

slifty commented 2 months ago

Two lines of thought that both warrant design and attention:

  1. How do we identify / determine gold data
  2. How do we represent / store / serve gold data

How to Identify

I think those three considerations @bickelj put out above are great -- we might want to consider an additional sub-category of "valid" to also be "acceptable quality" (I could imagine certain types of field where a value can be semantically valid from a literal field type PoV, but still be considered "low quality")

I also think that we can potentially be more granular related to known and expected party; e.g. on the call I was starting to get at the idea of having defined data sources with a hierarchy of "authority" -- e.g. direct entry from a user via PDC might have higher authority than a GMS source, or 3rd party data platform.

It might be helpful to think about some scenarios and what outcome our ideal "algorithm" would result in:

  1. Two proposals come in around the same time under the same EIN, but with different organization addresses.
  2. Separated by a year, two proposals come in under the same EIN, but with different organization addresses.
  3. A proposal and a third party data provider have provided conflicting organization addresses.
  4. A user has manually / directly provided a "corrected" organization address, a proposal comes in one year later with the previous / old organization address.
  5. A user has manually / directly provided a "corrected" organization address, a third party data provider has provided a new organization address that is significantly different any previous value.
  6. A proposal has indicated an annual operational budget of $4m a year, the next year a proposal indicates an annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.
  7. A proposal has indicated an annual operational budget of $4m a year, the next year a direct manual entry updates the annual operating budget of $5m a year, the final year a proposal indicates an annual operating budget of $4m a year.

There may be an argument for attempting to capture "longevity" of a given base field type -- for instance, annual budgets might have an intended longevity of 12 months, whereas organization address might be indicated as having indefinite longevity. (beyond just collisions, this could also help us highlight stale gold data in our system)

How to represent

If computation was infinite and instantaneous, the way to ensure the most "up to date" decision on gold data would be for the service to calculate the correct field values at the point of querying. We don't live in that world, so... moving on!

The next best I think would be to store gold data status somehow, and then update the statuses every time there is a new insert that would affect the data (e.g. every time there is a new data entered that touches a given organization, we update that organization's gold data representation)

This could be done as a "cache" -- but I think we would want that cache to be dynamic rather than a single table, since the columns of that table would need to update as base fields are added.

To that end, I think we could simulate that cache by just directly adding a column to the fieldValues table of something like "isGold" -- and at that point we can just have a query that selects the most recent fieldValue for a given baseField for a given organization.

(as an aside, I think we might want to normalize the concept of a fieldValue so that it can either be associated with a proposal version OR associated with whatever we call the entity that represents "organization data provided either directly or via a third party data platform". But let's talk about that later.)

bickelj commented 2 months ago

Just another thought among many: a user-flagged-this-datum table suggesting this value looks suspicious. And: if multiple sources/origins/persons upload the same data, shouldn't that also improve its goldness?

jmergy commented 2 months ago

timing would also be a major factor in this. I could see having a user associated with an organization be able to bless the proposal as "gold" in some way (or some aspects of the data as gold) but after a year, it may not be (budget, address, etc.) and said user will probably not come back and un-gold it. But, we could probably see other proposals come in more recently that maybe have exact or similar data that could either add validation/confirmation to the gold or raise a question to the previous denoted gold.

bickelj commented 2 months ago

When we first encountered the concept of "gold" data, I could easily visualize it like a deployment pipeline: all data are candidates, they pass through various trials, and a few successful data make it all the way to the end. In my original conception, this meant that presence in the PDC relational database meant "gold" and non-presence meant "non-gold." But that is not where we are going with "gold" in the PDC. We want data in the relational database, gold and non-gold. It is not a mechanical matter to discover which data are best. We can use a handful of "is this a valid value given that it is type X" checks to mark invalid data but that's small potatoes and misses the larger purpose.

There is an open question of how we approach this idea of "gold data." Some options come to mind:

For any of the above options we still need mechanisms to get human input on whether that person thinks a value is correct or incorrect. That feedback will have to be formalized for any of the above to work.

I like @hminsky2002's suggestion of crowdsourcing. The problem is we will not have a crowd from which to source, we might have a handful of folks. So my adaptation of that is to suggest "when two humans mark a value good, it's gold."

Regarding staleness, yes, I think in the feedback form we'd need some kind of timeout on human markings or allow humans themselves to say "my answer should be good for X duration."

bickelj commented 2 months ago

If computation was infinite and instantaneous, the way to ensure the most "up to date" decision on gold data would be for the service to calculate the correct field values at the point of querying. We don't live in that world, so... moving on!

I know it doesn't change our goals or direction but I think even if we had infinite and instantaneous computation we would still not have everything we needed to mark things gold or not. The "right" answer may not be anywhere online or in any computer system at all. It is the people interacting with the PDC software who judge whether it's gold or not gold and that judgment is in someone's mind until recorded in PDC.

bickelj commented 2 months ago

It might be helpful to think about some scenarios and what outcome our ideal "algorithm" would result in:

The above 7 scenarios (and more like them) are going to be key tests, yes, thanks for these!

bickelj commented 2 months ago

Just one more thought: until we design UI for marking gold/non-gold data I think we need to focus on the first level which is validity that can be automatically determined.

bickelj commented 2 months ago

As of this moment, my best thought is to leave "valid vs invalid" dichotomous but use several quality indicators rather than a "gold vs not gold", i.e. the option 'Use non-scale descriptive labels. Example: "Came from X", "considered good by Y on date Z.'

Each indicator individually should be dichotomous.

Example indicators/items:

The benefits:

The drawbacks (as opposed to deciding on "gold vs not gold" as the implementation):

Some of the difficulties, however, are inherent to the problem. How can one summarize the quality of an individual datum without some complex system of quality review? Either the details are hidden behind an overly simplistic dichotomous value like "gold vs not gold" or the real complexities are embraced. The nice thing about the many items approach is that with some calibration and refinement over time, it is technically possible to present a quantitative summary of the qualitative items.

bickelj commented 2 months ago

Notes from meeting today, 2024-07-23:

Reminder to self: this is about organization/changemaker data.

@jasonaowen suggests that for the time being we cannot assume any user review. @slifty says it is not scalable to integrate human review.

The general approach of many quality indicators (starting simply with isValid) seems not objectionable to the team.

@slifty Suggests now with provenance we have concepts of origins, such as prefer the organization/changemaker source over funder sources etc.

And isValid should be a first-pass, not combined with other quality indicators.

bickelj commented 2 months ago

The approach I am taking now is to create a new endpoint, tentatively organizationDetail/{organizationId}. The response will tentatively include both the best aka gold data for each ProposalFieldValue and ExternalFieldValue linked to a base field having organization scope and the list from which the best was drawn. It could potentially include a reason for selection of a given value. This could be for reasons of date, (future) Source, etc., and they will be valid as a starting point.

At the moment, the provenance (Source, ExternalFieldValue) work is ongoing in #1116, no problem.

As of #777 there is no explicit link between proposals and organizations anymore. An applicant used to need to be explicitly created (now called org) in PDC so that an applicant was always linked to a proposal. So this is a kind of problem because I was hoping to return ProposalFieldValues as candidates to be refined for gold. Maybe I could look at the fields in the application to figure it out. But we also start with a kind of blank slate for base fields so there is no guarantee that any of our seed base fields exist, otherwise we might be able to look up the organization via some anchor base field.

Oh. I now see organizations_proposals. It is there :blush: .

bickelj commented 2 months ago

The current tack of "get the best data by organization via some endpoint" was affirmed by @jasonaowen and @kfogel a few moments ago. @jasonaowen suggested folding it into GET /organizations (as hinted by me that I was open to such folding). @kfogel might prefer a scored list in the resulting body rather than separate/copied "best" and "all" fields, but is OK with continuing with the current draft of separating them. Because we don't have a scored list at the moment and because the use case for "all" is currently debugging, @jasonaowen and @kfogel seemed OK with the current separation.

Demonstrating the capability in action will require a bit of preparation because we need to make clear when posting, say, three separate proposals for one org/changemaker, that the "best" values for an organization can individually be drawn from any of these proposals, depending on validity and so forth.

This PR can get into a form where it can be merged prior to the Provenance PR #1116 and then enhanced using the new provenance data. And some types refactored a bit to account for the new data types present.

@slifty One thing I think we still need is some of the base fields to be marked with ORGANIZATION scope, because currently none of them are. I assume that's an easy thing to fix but I wonder if you would expect anything bad to happen if one of us marked every org... field with scope ORGANIZATION.

bickelj commented 2 months ago

@jasonaowen @slifty One thing that pushed me toward separating /organizationDetail from /organization (that I forgot to mention to anyone) is that currently /organization is wide open for reads to the public versus some of these details may or may not want to be hidden. Currently I have it set to require authentication but if folding this into /organizations makes sense we'd have to either choose to let these details be public or restrict what's currently public.

jasonaowen commented 2 months ago

Ah, authorization is an interesting question, @bickelj! I think that's a business decision we'll need @kfogel's input on.

That gets especially interesting when combined with the fine-grained authorization SOW item: if I ask for gold data of an organization, but I don't have permission to see all the relevant proposals, what data should I get back? (Could that situation actually happen with how we are planning fine-grained permissions?)

Sooner or later, I suspect we'll need to have this endpoint take into account who is making the request: an unauthenticated user, a user who can't see any of the relevant data, a user who can see some of it, a user who can see all of it, or an admin.

If we do need to keep it separate, I'd propose the route /organizations/:id/details or so.

bickelj commented 2 months ago

@jasonaowen Excellent thought about fine-grained permissions aka authorization. I assume we will filter these requests in accord with whatever authorization the user has. In other words, if the requester should not see proposal data from proposal X, then those data would not be selected for "gold" extraction in this endpoint. I pushed a commit that renames some of the keys because of this.

slifty commented 2 months ago

FWIW: I would expect the Organization entity retrieved from /organizations/:id to be populated with appropriately authorized data -- not to have distinct object types depending on the authorization level.

This would mean organization would ALWAYS have a fieldValues attribute, but it would only ever be populated with data that the user is authorized to view. If you are not authorized, then fieldValues would be [].

I think we should avoid having two different types of Organization entity in our API, and should instead just augment the existing /organizations/:id endpoint to support the new feature of field values on organizations.

bickelj commented 2 months ago

@slifty I think I can live with that. I am no longer the data upload guy so I don't have to suffer the (perhaps minor, perhaps major) performance penalties of such a design. My other concerns I think are addressed because the id and name would be distinct from fieldValues. Thus fieldValues is the volatile field, so to speak, somewhat separated out.

bickelj commented 2 months ago

@slifty @jasonaowen I added 61baf0b to merge the endpoint and some of the code. There is still some separation, some of which I think may be desirable, some not. I didn't get to the point of figuring out the if auth include more else do not include more fully before needing to switch projects and work on other tasks.

bickelj commented 2 months ago

Enough thinking and exploration has happened, now PR #1129 is headed in an actual implementation direction.