PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Should PDC store atomic fields or composite fields for names, addresses, etc.? #104

Closed bickelj closed 12 months ago

bickelj commented 1 year ago

From https://github.com/PhilanthropyDataCommons/service/pull/103#discussion_r1001854443

System A might use two fields for family name and given name. System B might use one field for full name. System A might use five fields for house number, street, city, state, and zip code. System B might use one field for address.

If PDC uses atomic fields (e.g. System A style) then composite field callers of PDC (e.g. System B) are burdened with splitting apart their fields. If PDC uses composite fields (e.g. System B style) then atomic field callers of PDC (e.g. System A) are burdened with joining their fields.

slifty commented 1 year ago

Just a note that in either choice someone will need to parse atoms (e.g. if we store as composite fields, a system that uses atomic will need to parse our non-atomic responses.)

I'd propose that we store atoms BUT that we plan to provide a utility API that is capable of breaking certain types of composite fields into atoms.

bickelj commented 1 year ago

@slifty Right you are. My comment above was when a system calls to post something, but when a system calls to get something, the converse is true too.

bickelj commented 1 year ago

This issue raises a more fundamental issue which is the process by which we establish the initial set of canonical fields. Regardless of which direction we go it has to somehow get integrated into that process. Do we keep the bash scripts alive until we have several more spreadsheets and then ditch the sheets and go with some manually curated JSON after that? Do we ditch the bash scripts now and replace them with more sophisticated software? Do we ditch the bash scripts now and go with fully manually generated JSON? These are some options. I am leaning toward the first (defer manual curating until after we have more spreadsheets).

bickelj commented 1 year ago

From yesterday's team meeting, it sounded like everyone was OK with the approach in #164. That approach is to have the initial set of canonical fields seeded by a SQL migration script. The migration script gets generated from a manually-curated spreadsheet. At the moment, the conflict raised in this issue is not addressed, but if we are OK with #164 then we have a path forward for setting the expectations in the manually-curated spreadsheet. As for multiplexing the fields to and from proposals, that's going to have to be part of the software that does the ingest.

bickelj commented 1 year ago

In a meeting today we discussed distinguishing people, orgs, and proposals. At the fundamental level currently all fields need to have a canonical field underneath them. But perhaps we might add another column on canonical field to distinguish whether this is a "person" field vs a "proposal" field. Dan suggested that we should not drive behavior about fields based on some prefix or substring, but rather as a separate property or attribute on that entity (I agree in general). We would prefer not to have a UI, for example, not searching through labels for the word "Organization ...". At the moment, we are proposal-centric in our data model, tracking all data through the lens of proposals, but there might be use cases to model and track organizations and people separately as such, in which case we might need to step back and revisit the data model.

Regarding addresses, I think we landed on having fields be as fine-grained as the finest-grained data that funder orgs have provided so far, e.g. Street Address 1, Street Address 2, City, etc.

Regarding names, a couple of funder orgs use first and last name. Jim advocates for splitting first and last name. We did not reach a firm conclusion here though.

Regarding internal field names, the goal with camelCase names is that these are used by JavaScript and the convention in JavaScript is camelCase attribute names but since we control these names it might be OK to change them however.

jasonaowen commented 1 year ago

Now that we have some real data, and some ability to view that data, @jim-mcgowan almost immediately observed a drawback to breaking apart this data: addresses can be interrupted by other form fields! For example, one proposal has the following fields in this order:

We can fix this in the data, but in general I would advocate for a data model that cannot break in this way.

Specifically with addresses, I feel pretty strongly that we should treat them as atomic a single field, and maybe even run them through an address normalizer. This starts getting into some data validation concerns, and I don't know if we've settled on a stance for that.

For names, the article Falsehoods Programmers Believe About Names is required reading here, and my takeaway is to do as little modification of names as possible. I would prefer a single name field, both for the concerns it lists as well as for the risk of splitting up the fields in the display order, but recognize that the source data might often have already split names, and so joining them would be performing an operation on them, which as I just said is risky.

jasonaowen commented 1 year ago

I'll also note that the initial set of canonical fields is not consistent on names: https://github.com/PhilanthropyDataCommons/service/blob/50b31ab1c6a6056bf05e99a22cad68b1ea7de891/src/database/seeds/0001-insert-canonical_fields.sql#L128 vs https://github.com/PhilanthropyDataCommons/service/blob/50b31ab1c6a6056bf05e99a22cad68b1ea7de891/src/database/seeds/0001-insert-canonical_fields.sql#L132-L136 vs (I think) https://github.com/PhilanthropyDataCommons/service/blob/50b31ab1c6a6056bf05e99a22cad68b1ea7de891/src/database/seeds/0001-insert-canonical_fields.sql#L55-L57

bickelj commented 1 year ago

one proposal has the following fields in this order

I would sort by canonical field order if presenting canonical field names or sort by position of the proposal if presenting the application form field names. That raises the question of how to sort by canonical field names given that we do not have a position in canonical_fields, so for now I would go by id or created_date, only because it is there.

This starts getting into some data validation concerns, and I don't know if we've settled on a stance for that.

Until we see (or just now saw) some actual data in the PDC, we don't have much of a basis for doing the data validation or cleaning. If we are too strict, we will exclude a large part of the data or make it too difficult to see the data in the viewer. Bear in mind these data were posted rather quickly without validation of the data other than trying to make sure they were text. But now that we have data, we can start making clearer choices, I think.

bickelj commented 1 year ago

I would prefer a single name field

Me too.

slifty commented 1 year ago

I'm on board for single name!

jasonaowen commented 1 year ago

We discussed this in today's meeting.

There are a few use cases we need to consider and plan for, primarily around consumers of this data. First, users will want to be able to search by location - eg "find proposals from organizations headquartered in Florida". Second, there will be partners who want to work with the data for whom "use an address parsing or geocoding library" is a blocking requirement; they want to have a CSV with a "state / province" column, or they want that presented directly in the API. I will open separate issues for these use cases.

After that discussion, my stance is that the data model should have a single, normalized address field - not split fields. In general, I think the API and/or our data export scripts should enrich outgoing data with the fields our consumers expect; in the case of address fields, we should include both the single field and the individual components of an address, including street address, city, state or province, and country.

I will note that the single field approach could make naive search by state a little worse: without specific address support, searching for proposals with an organization_address field having a value like Florida is more likely to pick up irrelevant results like 123 Florida St, Elsewhere, MN. I believe that's a worthwhile tradeoff for having a stronger data model that is less vulnerable to inconsistency and the ordering problems described earlier. It's also worth pointing out that accepting raw data without processing or validating it will also degrade search, as some proposals will have organization_state_province: FL and others will have organization_state_province: Florida and others will have organization_street_address_1: 123 Full St, Miami, FL; we've already seen some organization_street_address_1 fields with such full address values.

bickelj commented 1 year ago

Here is a possible solution for combining two funder (application form) fields into one base field: the label can optionally use a template literal syntax (not necessarily evaluated by JS or TS directly) to combine multiple fields.

For a one-to-one example, the funder field "Proposal Name" can map to base field "Proposal Name" just like that: "Proposal Name", while a person's name that is broken apart in the funder system such as "First Name" and "Last Name" can map to "Proposal Submitter Name" like "`${First Name} ${Last Name}`".

@slifty @jasonaowen What do you think?

jasonaowen commented 1 year ago

@bickelj can you say more? I'm not sure I follow the problem you're trying to solve, or how/where exactly the solution you propose would be implemented.

Are you talking about the ETL scripts you're writing, maintaining, and using to get partner data out of spreadsheets and into the PDC? Where would the template label go?

jim-mcgowan commented 1 year ago

@jasonaowen Robert Wood Johnson has First Name, Middle Name and Last Name fields in their application as well as Prefix and Suffix. Similar to addresses, how do we want to accept and store name data?

jasonaowen commented 1 year ago

@jim-mcgowan we previously aligned on a single name field that combined all of those into a single string. We have not fully implemented that decision - getting back to our earlier conversation about needing to figure out how we make changes to base fields - so there are both split and unified name fields; see my earlier comment for some examples. For unified name base fields, it is the responsibility of the API client (@bickelj's ETL scripts here, but in the future, external partners' scripts) to transform split name fields as necessary.

Addresses are a bit more complicated and my thinking on them has been influenced by our discussion on variants (#333), but I haven't had a chance to fully document a vision for them.

jim-mcgowan commented 1 year ago

@jasonaowen That's fine with me . . . I was mostly trying to explain the problem @bickelj was trying to solve.

bickelj commented 1 year ago

@jasonaowen Every Proposal Field Value is in response to one Application Form Field. Every Application Form Field relates to one Base Field and has one label (presumably the text from the actual/original application form). Where multiple actual/original application form fields map to a single Base Field, the question arises as to how to write that into the label. The question was indeed regarding PDC's data ingest software that is still under my care, so your response suggests that any funder or data platform provider could handle this situation however they like, and write whatever they want to write into the Application Form Field label, which is true. But I thought I would ask y'all in advance of you seeing, e.g. on the front-end, funky-looking values in the Application Form Field labels. At the moment perhaps it is moot because they aren't displayed anywhere to my knowledge. The only downside to getting it wrong is yet more thrash on this data pipeline.

bickelj commented 1 year ago

@jasonaowen More important than "how to write that into the label" is "how to actually perform the mapping." In the current process, the label of the Application Form Field is used so the questions are synonymous. I suppose it doesn't have to be that way but it works.

jasonaowen commented 1 year ago

Where multiple actual/original application form fields map to a single Base Field, the question arises as to how to write that into the label.

Aha! So is the goal to be able to accurately reproduce application forms, or to coherently and uniformly represent the information collected by such forms? My understanding and stance has been the latter, which leads to the difficulty and judgement calls you're talking about, including synthesizing application form field labels not directly present in the source material.

Here is a possible solution for combining two funder (application form) fields into one base field: the label can optionally use a template literal syntax (not necessarily evaluated by JS or TS directly) to combine multiple fields.

... a person's name that is broken apart in the funder system such as "First Name" and "Last Name" can map to "Proposal Submitter Name" like "`${First Name} ${Last Name}`".

any funder or data platform provider could handle this situation however they like, and write whatever they want to write into the Application Form Field label

Your original proposal, then, was to make the application form field's label be something like "`${First Name} ${Last Name}`"?

ask y'all in advance of you seeing, e.g. on the front-end, funky-looking values in the Application Form Field labels. At the moment perhaps it is moot because they aren't displayed anywhere to my knowledge.

+1 to asking, sooner is better than later! I think we ought to plan for application form field labels to be user facing, even though we haven't built that functionality out yet. To that end, I don't think putting any kind of templating system there would be a good fit. If you're having trouble coming up with good labels for these cases, maybe share a few examples for us to go through?

jasonaowen commented 1 year ago

More important than "how to write that into the label" is "how to actually perform the mapping." In the current process, the label of the Application Form Field is used so the questions are synonymous. I suppose it doesn't have to be that way but it works.

It sounds like the current process isn't really set up for handling translation between unified and split fields (in either direction). I don't have an immediate suggestion on how to improve that, unfortunately; I don't feel like I have enough context. Disentangling the label from the mapping seems like a useful step, and having some kind of templating seems like a promising solution for mapping split fields in the source data to unified fields in the PDC.

bickelj commented 1 year ago

@jasonaowen On the one hand, I'm glad I asked for feedback earlier than later. On the other hand, it's easier to notice a thing we don't like or don't prefer than it is to suggest a constructive way to achieve the goal. Perhaps this is more a question for @slifty: what suggestion do you have for the multiple field mappings?

bickelj commented 1 year ago

If you're having trouble coming up with good labels for these cases, maybe share a few examples for us to go through?

My understanding is the labels for an Application Form Field were intended to be the original/funder labels. @slifty ?

slifty commented 1 year ago

@bickelj I am sorry I missed this earlier. You're totally right and that design absolutely was spacing out on the fact that fields are going to need to be merged in a many => one situation. (this is not addressing your question yet, as I have to run off for the day, but will return tomorrow and try to have more useful words to contribute by then)

bickelj commented 1 year ago

@slifty No problem, it might be a deeper change than I anticipated. For example, if we want to model it at the database level, we probably need another table to capture the range of one-to-many or many-to-one (application form fields to base fields). But even if there is a table capturing the nature of the one-to-many or many-to-one relationship, that by itself doesn't necessarily solve the concatenation or split strategy. More columns perhaps are needed in that new table to capture the delimiters needed to concatenate or split.

slifty commented 1 year ago

So for this phase what I would suggest is that we actually might be best off just creating a label that is most meaningful (rather than the most lossless).

To that end, I would suggest just making the application form field label be something like "Full Name" -- the purpose of these labels is to create a situation where the viewer of the data can get a better sense of what the application form values mean in the context of that application form (as opposed to the base field labels which gives a sense of what they mean in the context of the PDC). I think it's OK for us to manually translate that meaning from the original literal labels into the new merged field (this field mapping is us working on behalf of the GMS, pretending we're them -- and that kind of concept mapping is what we would be asking them to do via a field mapping interface).

Does that make sense / seem reasonable?

slifty commented 12 months ago

This was a great conversation, and will also feed into #396

I'm going to close it in preference for that issue. This also might be touched on as part of #561