PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Consider a non-hardcoded mechanism for identifying specialized base fields #922

Open slifty opened 2 months ago

slifty commented 2 months ago

As part of #902 we are going to have our bulk uploader treat a few base fields as "special" -- organization name and EIN, specifically.

I'd love to avoid hard coding special short codes that may or may not exist, but the initial implementation plan will do that. The "right" answer may involve creating a new "configuration" table in the database, which would allow an administrator to select special base fields.

This issue is intended to capture that line of exploration. It may turn out we don't want to do this!

bickelj commented 2 months ago

@slifty I support the goal of a non-magic, non-hardcoded mechanism to identify certain data fields. Somewhere there has to be a link, though, right? Somebody has to mark data in such a way that the service can identify it.

One extreme is the caller could do whatever they want and the service could use heuristics to magically guess which field is what. The other extreme is the service requires the caller to mark fields explicitly, e.g. with a particular string. Using a short code is this latter extreme. The benefit is simplicity and clarity. "You need to name it X for the system to know it is X." The service can say "warning: you didn't supply X" or "warning: X did not match expected regex" or "warning: Y is an unknown field so it was not bound to anything."

I need to try the bulk uploader because I assumed that not only a few base fields would work this way but all base fields. The PDC team constantly curates the base fields, the uploaders map base fields to their fields via application forms, and proposals are uploaded to fit the application forms.

slifty commented 2 months ago

Somewhere there has to be a link, though, right? Somebody has to mark data in such a way that the service can identify it.

So the suggestion of a config table would be that the link is ALSO determined by data -- which feels more explicit to me (the sysadmin, not the code, determines what base field in their instance should be used for this purpose). One benefit of that is it's dynamic which means we can have an endpoint that communicates the configuration, and the front end can render the setting dynamically to the user to remove the sense of magic.

To be clear there are CERTAIN users to which this is all the same thing (bulk uploaders are going to rely on UX to prevent magic), but I think that having a data driven config approach would make the system more explicit for technical users, and for the system administrator.

The messaging you described would still be happening, and would still be able to be in terms of short code; it's just that the short code that gets messaged is in control of a person rather than the code base.

Also I do think that having the link determined by the functional definition of the base field is less magical than a string match. "We have defined this base field, using documented enums, to be an organization base field that contains an EIN" being used by the importer as the organization EIN feels way less magical to me than "this base field name happens to be the string organizationEin".

bickelj commented 2 months ago

@slifty What if we convert the base field name to be a documented enum or treat it that way conceptually? Isn't the new documented enum taking the place of the old base field name? It feels like it's both pushing the problem back one step (we don't have a hardcoded list of names, we have a hardcoded enum) plus adding some indirection. The net effect is more complexity and difficulty.

I see one benefit you mentioned: system administrators can vary their base fields per instance. But isn't one of the products of the PDC a flat, universal list of base fields? We don't want everybody using different base fields, I thought.

slifty commented 2 months ago

Isn't the new documented enum taking the place of the old base field name

it is not! The field name / shot code cannot be an enum because the entire point is that it is dynamic and user populated as a feature, allowing the list of base fields to change and adapt over time. Those changes are not universal in the sense of "all instances of the PDC software must adopt them" (though as we talked about in #756 we may want to allow instances to "subscribe" to another instance's base fields).

The enums around base field scope and type are separate from short code completely. They're explicit, semantically meaningful attributes of a base field (is this base field organizational? what kind of data does it represent?). They're being created regardless (one already exists, and the other is going to soon), because that semantic meaning has implications for things like UX, validation, and API responses.

The purpose of a shortCode is to make it so third parties can communicate which base field they want to interact with in a more meaningful way than a sequential ID.

I guess to use an analogy: short codes are variable names, field types are variable types. In programming languages it is meaningful to have a defined list of types, but needing to name your variable in a way that signals the type is not a good requirement (and having a compiler that treats variables differently based on their name is even worse)