PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Implement initial provenance support #1094

Open slifty opened 2 months ago

slifty commented 2 months ago

Over in #1083 we spent some time talking about provenance for this system.

Ultimately we are going to want to track "who uploaded this data" and "where did it come from"

There were some decisions left to the implementer around specific naming for these fields, as well as whether we want to create a provenance entity which combines them (as opposed to simply having two separate fields on the appropriate data table).

slifty commented 1 month ago

Thinking through types of source has helped shed a bit of light on the potential implementation details, so I'm going to carry that thought process here.

First, some concrete example sources / personas (not worrying about types yet):

  1. The executive director of a non profit organization updates the organization address.
  2. A grant manager updates a field associated with a proposal
  3. A system administrator updates any field
  4. A batch processing script loads organization data from a data platform provider (e.g. Candid) and puts it into the PDC.
  5. A batch processing script loads proposal data from a grant management system (e.g. Submittable) and puts it into the PDC.
  6. A funder uses batch upload to add a set of proposals to the PDC.

It seems clear to me that we want the following sources to be specific:

What is less clear to me is various types of direct entry -- when a user is manually editing or uploading data do they become the source? Should they get a row as an individual, or should there be one generic "direct entry" source?

(I have to pause, but will return)

slifty commented 1 month ago

I think I have a more concise list of scenarios which will help us nail down the use case (which, once we understand, will allow us to hone in on the implementation / design)

Here are the scenarios, and I put ??? in places where I'm going to want to get clarity from @kfogel / @jim-mcgowan / @jmergy

Third party data provider cases

Funder cases

Changemaker cases

PDC admin cases

External client cases

I think that a quick voice chat might be helpful if any of the above are available for a quick call!

kfogel commented 1 month ago

Note: @slifty and I are on a call about this right now, as per above.

slifty commented 1 month ago

We just had a great conversation about this and here's where things landed (hopefully future dan will completely understand what I'm typing right now):

  1. We will want provenance to be MacArthur Fluxx and MacArthur Agent respectively.
  2. We will want change makers to have their own sources.
  3. We will want PDC agent to be a source.
  4. Each account will be associated with the sources they are able to "post" as (superusers can post as any source)
  5. We should think about the relationship between Organization and Source just to think about whether there is any normalization redundancy there. Could go either way.

(stay tuned)

slifty commented 1 month ago

Regarding the question of normalization, I am leaning towards "source" being a polymorphic mapping entity. It would have a "source_type" and "source_id" field -- depending on the type, the source would point to either a funder, organization or data_provider entity.

Pros:

  1. When looking at a source we will benefit from the complexity of those objects (e.g. all attributes of a "funder" would be available / returned along with the source)
  2. We are normalized for things such as the name of the source (which would exist in organization.name)

Cons:

  1. the source entity is somewhat more complex (depending on the source type it would either have an organization funder or dataProvider attribute.)
  2. New types of source would need new types of entity.

The alternative would be to have a sources entity with no foreign key relationship; just a type, a name, etc.

Ultimately I think the tradeoff / having access to richer data related to a given source is worth it.

slifty commented 1 month ago

Talked to @jasonaowen about this and we landed on......

NOT that!

Basically, the big downside of the polymorphic approach is that data integrity can fall apart over time (since it isn't a DB-enforced foreign key relationship it becomes possible for records to get deleted without cascading the deletion).

So, we're gonna just have a source table that has one column for potential source entity, with a table rule that only one column can be non-null.

Jason pointed out that we don't really need the type any longer at a DB level at that point since you can extrapolate the type based on which field is not null.

There may still be TypeScript benefit in having the type value ultimately map to a discriminating union implementation here (which would make it clear that only one value can be populated at the TS level).

slifty commented 1 month ago

Almost almost almost done thinking about this (measure twice merge once amirite)

Right now we have Organization as an entity in our system. The only kind of organization that exists today is one that submits an application.

There are other types of organization (from a literal, real world sense) we're imagining though:

  1. Funder (organizations that sponsor opportunities)
  2. Data Provider (organizations that aggregate data in some way)

Before I go and create those new entity types I wanted to take one step back and reflect on whether or not these are actually distinct entity from Organization.


For now I think that Organizations is created with a very specific use case in mind (being an entity that is stored in the PDC and decorated by PDC data). We may some day want to associate a funder with an organization profile, but that can be done via a relationship.

Bottom line, three entities representing three distinct functions in the system is appropriate.