PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Develop updated database schema (ERD) for Phase II #396

Closed jasonaowen closed 1 month ago

jasonaowen commented 1 year ago

Plan out and implement the database schema we need in Phase II.

Phase I

Last year, the team came up with a schema for representing application data. The problem particularly required flexibility, as application forms and their responses can vary widely in shape. Their solution - including base fields, application versions, and field values - worked out very well!

Requirements

Over the course of Phase I, we learned quite a bit about the problem we were solving. A few notable requirements began to surface and rise in priority: provenance, data validation, organization-centric (as opposed to application-centric) information, and corrections / user edits. We struggled to adapt the original schema - which was and remains very fit-to-purpose for application data - to accommodate these new requirements.

Provenance is how we surface where data came from, and covers several distinct ideas: which funder provided the data, when was data loaded, who loaded it, and what client they used to authenticate to the API.

Data validation is important to ensure that the data we're displaying is correct, and to allow clients such as the data viewer to display it appropriately. However, the quality of the data we receive varies, and we need to avoid excluding data (or worse, partners) with too-difficult data validation requirements.

Organizations quickly became the primary focus in our UI. What began as application-centered pages transitioned into organization-centered pages, even while the data model remained focused on applications. That tension between model and focus can be felt in a few places in the data viewer, most notably with the data platform provider feature and subsequent requests and feedback.

Editing organization data is a complicated request: can we allow organization representatives to correct information about that organization? That request makes sense when the focus is on organizations, but makes less sense when the focus is on application data.

Partner conversations

Recently, we talked with Charity Navigator and Candid about how they've approached their data ingestion and modeling problems, and I found those conversations to be very productive. Two comments stood out to me: "be a data scavenger" and "bundle data differently depending on need".

"Be a data scavenger" is an approach where we accept all data, no matter the quality, and then apply some processing to harvest what usable data we can based on our current understanding. As that understanding evolves, we can revisit that raw data.

"Bundle data differently depending on need" is an approach where rather than trying to make a one-size-fits-all schema, we adapt our data to suit specific use cases.

Phase II

Let's think about what data entities we need to model, design a normalized database schema to hold that data, and plan a data pipeline to promote and unify data from disparate sources into a coherent, structured model.

Entities

I think our models need to include:

There may be more!

Schema

The entity-attribute-value model is a perfect fit for application data: application forms can have any questions in any order, and responses can contain any number of any type of answer to each question. We have been able to easily add additional base fields as we gathered more data, and load all the application data we received.

While the EAV model allows for high fidelity encoding of the data partners give us, it does not allow us to easily work with that data. Everything is a string, we cannot use the database to enforce any integrity or quality constraints, and querying the data is complicated and tedious.

For the entities we decide on, we should use a more traditional schema design. We should use the data types PostgreSQL provides us, and we should have normalized tables. We should not use the EAV model for anything other than application data.

Data pipeline

Currently, we have two pipelines, both very short:

  1. "White glove service":
    • funder emails spreadsheet of applications
    • PDC developer maps base fields
    • API
  2. Partner data entry:
    • partner has a grant management system
    • partner developer maps base fields
    • API

(Side note: the "map base fields" step involves a lot of manual work and clean-up, partly because we've been wanting to have a higher level of data quality. This is not necessarily inappropriate, but we should discuss this!)

I'd propose a richer pipeline once data is loaded into the API. This should include parsing out the entities we decide on, flagging valid and invalid data, bringing in related data from partner data platform providers and user edits, and so on. One of the steps of parsing out entities would likely be merging split base fields, or splitting merged base fields, depending on our models.

Summary

In short, we should

bickelj commented 1 year ago

We need a second API for raw data parallel to the API for processed data.

The current situation is an API for processed data. The original data are scattered and the processing is off the record.

A second API and storage solution (the storage API could be the API, I suppose) for raw data would be a good start to building a pipeline that connects raw data to processed data served by a strongly typed or strongly validated API for processed data.

kfogel commented 1 year ago

See related issue #220 ("Add import tool for matching source fields to base fields") over in the data-viewer repository.

bickelj commented 1 year ago

In a recent meeting we spoke about this issue but did not come to a solid consensus on all matters.

I bring up the PDC data flow pipeline because that is the view with which I'm most concerned. My concern is not only a personal concern as the one doing data transformation, but the overall utility of the system. I shared a visual summary hoping to point out what I consider the next most important thing to implement: document storage with authorization. Here is an updated version based on our conversations highlighting the "extend to the left" and "extend to the right" possible futures. Data flows from left to right. The horizontal squiggly with note underneath shows the scope of what is captured formally by the PDC system. My focus is "extend to the left." This proposal focuses on "extend to the right." The place from which to extend PDC's formal representation of data is the current "PDC Database" on the right of the first diagram.

PDC_proposal_data_flows_2023-08-23

The two futures are not mutually exclusive. We can both extend left (with document storage) and extend right (with data promotion to new entities). But our debate regards keeping or dropping the Base Fields modeling and/or promotion of data within the formal database schema found in this codebase.

I can't separate the discussion here from extending to the left because if we are to be data scavengers we need to keep raw data within the PDC. In the first diagram, in Phase I, some data were necessarily lost during transformations on the way to the PDC database. And that's OK, because we want the most useful data in the most useful form in the PDC database. At the same time we want to lower the barrier to letting people put data into the PDC. At the same time we want to be "data scavengers." Just because some data didn't make it into the database on our first pass doesn't mean it shouldn't be possible to extract it in a future pass. The simplest way to do allow for future extraction is to keep the original data.

It seems to me we agreed new entities should be added. The debate seems to be whether to (a) keep Base Fields only for semi-processed data plus a promotion step for good data to new strongly-typed-at-db-level entities or (b) use Base Fields in those new entities without a promotion step, the promotion steps being part of the left side of the pipeline.

My objection to promotion of fine-grained data within the schema ("extend to the right") is the overhead associated with two sets of tables representing the same data: the "raw" or "semi-processed" subset of data from the original documents that made it into the schema in the first place. The overhead is in additional API calls, queries, GUI views, etc., all for the purpose of promoting from the existing entity to the new. Given that this is already processed data, meaning data broken down from document (e.g. .csv, .xlsx, .pdf) form into individual fields, I would rather focus our efforts on retaining the original raw documents and building new GUI views into those, because we'll have to do that anyway to allow folks to semi-process semi-independently. In other words, I'd like to shore up the database entities and validation around proposals (whether using Base Fields or not) such that promotion into the database is the last step of the pipeline.

If we "get it wrong" with regard to needed data in a new entity, do we want to require a database migration to represent the new field or do we want to be able to add fields without a database migration? Are there gains on the GUI side of things when we have the database validate fields (I would imagine that's invisible to the GUI)?

In general, I favor the use of the database types, but we are not deciding in the abstract, we have a concrete system that already exists. Given that the system was designed with the future in mind, and that we have a working system already, I think we have to consider the (future) cost of what I see as a larger change with higher overhead versus the (future) cost of embracing the current design.

There will be application-level type validation of data required in any future we choose. I don't think we want to extend Postgres to have the database validate a phone number, for example. Or is that part of this proposal too, custom check constraints on database fields? If we are talking about standard SQL types, we will need application-level type validation whether we gain Postgres typing or not, so I lean toward keeping the original Base Fields model at this time and adding application-level types to them (type names stored in the Base Fields table, as discussed elsewhere).

I don't think we need to wait to "extend to the left" either. We will have many more futures to choose from once we have file storage integrated with authorization. If extending to the left is non-controversial, let's start doing it because it is a big enough project regardless of how we resolve the Base Fields discussion.

slifty commented 12 months ago

I'll give this a shot...

My underlying mental model is that we would need the following entities related to access and account management:

  1. User
  2. Organization
  3. Data Provider (candid, charity navigator, PDC front-end, GMS's)

In terms of the actual PDC data we would have:

  1. Base Field
  2. Opportunity / Application Form / Application
  3. Field values (application form field values, direct entry field values, and "processed field values")

Organization vs Proposal

The entity associated with a base field would be defined at the base field level -- e.g. some base fields represent organization-level data, others represent proposal-level data. This would be specified via a new column in the base_fields table which would be something like scope and would contain an enum of organization | proposal.

Accessing data related to a given use case (e.g. I want to view organization PDC data) would translate to a query (which we might choose to store as a view) against the field values table, with a condition of only selecting field values whose base field scope is organization.

Atomic vs Compound fields

We would define base fields as compound, rather than atomic fields (atomic meaning "targeting the smallest possible form"). This is to say we would have a single base field to store an entire address, rather than separate base fields for street, city, state, etc.

Some example compound fields:

Field Mapping

Application form fields would be mapped to base fields not just in terms of base field ID but also, where relevant, in terms of the component of the base field. For instance an application form might collect "organization street name" which would map to "organizationAddress.streetName" or it might collect "organization address" which would map to "organizationAddress".

These mappings would be defined as part of the "application form field" definition.

Data ingestion, validation, and decoration

The application field values table would continue to exist as it does today: storing the raw, unformatted and unvalidated string data associated with a given applicant's response against an application form field.

We would also have a direct_entry_field_values table to similarly handle raw data ingested from 3rd party data providers. This would have a similar mapping structure as application field values (e.g. it maps raw data to a base field + field component).

We would have a new table to represent the validated, consolidated "PDC Standard" form of the proposal data: processed_field_values which would store the fully mapped, validated, and decorated form of our data as a jsonb column. The jsonb column would be validated by the database (which seems possible) according to the type of the base field it links to.

processed field values would be associated with either (1) a proposal version, and therefore an organization, or (2) an organization directly.


All of this would allow us to dynamically reap the benefits of validation and structure for ALL current and future base fields. We would still be able to rely on the database to ensure validation.

We would have a direct map of how a processed field value came to be (e.g. we would be able to know which application form field's value mapped to a given sub-component of a compound field)

We would be able to easily query in terms of organization vs proposal fields.

jasonaowen commented 12 months ago

Now that we have user stories done (or at least, done enough to move forward), we should have enough context to have this discussion again.

I believe the main goal is to unify three distinct sources of data about organizations: proposals, data platform providers, and user corrections. Proposals use our base fields, currently do not have any kind of type checking, and a single "piece" of data may be divided across multiple base field values. Data platform providers provide data about organizations in a structured & documented form, and while more valid, the data may still contain errors. User corrections will be to organization data - not to proposal data, at least in this phase - and are both additive and subtractive; that is, they may include suppressing (not replacing) incorrect data from another source.

This unified organization data will be shown in partial form to unauthenticated/unauthorized users, and in full form to authorized users. (I believe there are currently only two levels planned, partial and full.)

We will need to come up with algorithms for unifying this data; it seems clear to me that user corrections have the highest priority, but what do we do when there is a conflict between the data from a proposal and data from an infomediary? What do we do when there is a conflict between two infomediaries?

Some organization data is structured, and some data validation cannot reliably be done in real time. An example of both is geocoding: given an address, look up its latitude and longitude. It is structured because latitude and longitude are closely related and should not be separated, and it cannot be done as part of the POST request because it relies on network calls to outside entities, so for reliability we must do it as an async job.

@bickelj frames this well as "extending to the left vs extending to the right". Given that our existing system is working, I believe it serves as a good foundation for both, and we needn't block one on the other: they can meet in the middle with base field values. I don't believe we have a specific issue for the entire workflow, but are tracking it in the data entry project. However, an open question related to that is how we select our base fields: do we prefer compound fields (such as "address"), prefer separate fields (such as "city", "state", etc), or allow both?

Hopefully this is enough context to discuss today!

slifty commented 1 month ago

This issue served a purpose but ultimately we can close it -- Phase 2 is over, and the changes we made to the schema were captured in a variety of other issues as we developed features.