Implement initial provenance support

slifty commented 2 months ago

Over in #1083 we spent some time talking about provenance for this system.

Ultimately we are going to want to track "who uploaded this data" and "where did it come from"

There were some decisions left to the implementer around specific naming for these fields, as well as whether we want to create a provenance entity which combines them (as opposed to simply having two separate fields on the appropriate data table).

slifty commented 1 month ago

Thinking through types of source has helped shed a bit of light on the potential implementation details, so I'm going to carry that thought process here.

First, some concrete example sources / personas (not worrying about types yet):

The executive director of a non profit organization updates the organization address.
A grant manager updates a field associated with a proposal
A system administrator updates any field
A batch processing script loads organization data from a data platform provider (e.g. Candid) and puts it into the PDC.
A batch processing script loads proposal data from a grant management system (e.g. Submittable) and puts it into the PDC.
A funder uses batch upload to add a set of proposals to the PDC.

It seems clear to me that we want the following sources to be specific:

Instances of third party data platforms (e.g. Candid should be one row, Charity Navigator another)
Instances of grant management systems (e.g. Macarthur Foundation's GMS would be one row, Ford Foundation's another)
Instances of third party raw repositories (e.g. The IRS should be one row, the Pennsylvania business database another)

What is less clear to me is various types of direct entry -- when a user is manually editing or uploading data do they become the source? Should they get a row as an individual, or should there be one generic "direct entry" source?

If a user is entering via the PDC client on behalf of the PDC (e.g. a PDC admin) should the source be "PDC Frontend"?
If a user is entering via bulk upload should the source be "Jane Doe", "Foo LLC", "PDC Bulk Upload"?
If a user is entering via API directly should the source be "User", "PDC" "PDC API"?
If a user is acting on behalf of a non-profit (e.g. the executive director) should the source be "PDC", "Foo Foundation", or "Jane Doe".

(I have to pause, but will return)

slifty commented 1 month ago

I think I have a more concise list of scenarios which will help us nail down the use case (which, once we understand, will allow us to hone in on the implementation / design)

Here are the scenarios, and I put ??? in places where I'm going to want to get clarity from @kfogel / @jim-mcgowan / @jmergy

Third party data provider cases

Candid uploads data directly (provenance: Candid + candid integration account)
PDC system downloads / sync with candid (provenance: Candid + system account)

Funder cases

MacArthur Fluxx uploads data directly via API integration (provenance: Fluxx / MacArthur / MacArthur Fluxx + Fluxx integration account)
PDC system downloads / syncs with MacArthur Fluxx (provenance: Fluxx / MacArthur / MacArthur Fluxx + system account)
MacArthur user uploads data that was exported from Fluxx via PDC Bulk Upload (provenance: ??? + user account)
MacArthur user uploads data that was collected directly / outside of Fluxx via PDC bulk upload (provenance: ??? + user account)
MacArthur user edits data directly via PDC interfaces (provenance: ??? + the user account)

Changemaker cases

The Human Fund uploads data directly via API integration (provenance: ??? + human fund integration account)
The Human Fund uploads data via bulk upload (provenance: ??? + the user account)
The Human Fund edits data directly via PDC interfaces (provenance: ??? + the user account)

PDC admin cases

A PDC administrator edits data directly via PDC interfaces (provenance: ??? + the user account)

External client cases

The Human Fund edits data directly via non-PDC third party interface (provenance: ??? + the user account)

I think that a quick voice chat might be helpful if any of the above are available for a quick call!

kfogel commented 1 month ago

Note: @slifty and I are on a call about this right now, as per above.

slifty commented 1 month ago

We just had a great conversation about this and here's where things landed (hopefully future dan will completely understand what I'm typing right now):

We will want provenance to be MacArthur Fluxx and MacArthur Agent respectively.
We will want change makers to have their own sources.
We will want PDC agent to be a source.
Each account will be associated with the sources they are able to "post" as (superusers can post as any source)
We should think about the relationship between Organization and Source just to think about whether there is any normalization redundancy there. Could go either way.

(stay tuned)

slifty commented 1 month ago

Regarding the question of normalization, I am leaning towards "source" being a polymorphic mapping entity. It would have a "source_type" and "source_id" field -- depending on the type, the source would point to either a funder, organization or data_provider entity.

Pros:

When looking at a source we will benefit from the complexity of those objects (e.g. all attributes of a "funder" would be available / returned along with the source)
We are normalized for things such as the name of the source (which would exist in organization.name)

Cons:

the source entity is somewhat more complex (depending on the source type it would either have an organization funder or dataProvider attribute.)
New types of source would need new types of entity.

The alternative would be to have a sources entity with no foreign key relationship; just a type, a name, etc.

Ultimately I think the tradeoff / having access to richer data related to a given source is worth it.

slifty commented 1 month ago

Talked to @jasonaowen about this and we landed on......

NOT that!

Basically, the big downside of the polymorphic approach is that data integrity can fall apart over time (since it isn't a DB-enforced foreign key relationship it becomes possible for records to get deleted without cascading the deletion).

So, we're gonna just have a source table that has one column for potential source entity, with a table rule that only one column can be non-null.

Jason pointed out that we don't really need the type any longer at a DB level at that point since you can extrapolate the type based on which field is not null.

There may still be TypeScript benefit in having the type value ultimately map to a discriminating union implementation here (which would make it clear that only one value can be populated at the TS level).

slifty commented 1 month ago

Almost almost almost done thinking about this (measure twice merge once amirite)

Right now we have Organization as an entity in our system. The only kind of organization that exists today is one that submits an application.

There are other types of organization (from a literal, real world sense) we're imagining though:

Funder (organizations that sponsor opportunities)
Data Provider (organizations that aggregate data in some way)

Before I go and create those new entity types I wanted to take one step back and reflect on whether or not these are actually distinct entity from Organization.

For now I think that Organizations is created with a very specific use case in mind (being an entity that is stored in the PDC and decorated by PDC data). We may some day want to associate a funder with an organization profile, but that can be done via a relationship.

Bottom line, three entities representing three distinct functions in the system is appropriate.

PhilanthropyDataCommons / service