TechAndCheck / tech-and-check-alerts

Daily tip sheet for fact checkers
MIT License
13 stars 6 forks source link

Should we be storing normalized known speaker names? #343

Open reefdog opened 4 years ago

reefdog commented 4 years ago

While reviewing PR #342, these lines made me 🤔. But since it's not the fault of that PR, I'm creating this issue for discussion.

A speaker is just a person, but we have two "speaker" concepts: a claim's speaker (currently not a distinct model, just speaker-prefixed fields on Claim) and KnownSpeaker.

The former stores speaker's names in full, no first/last split. The latter stores first and last in separate fields. This reflects the different data sources for each: web scraping and Google Sheet syncing, respectively.

That means that to compare the two, we have to first coerce the known speaker fields to look like what we expect the claim speaker's full name will look like, and then compare. There could/should be a JavaScript utility function we use to centralize this coercion.

What triggered me about the PR, though, is we're writing raw SQL, and thus couldn't use a JS utility function anyway; instead, we have to bake in an assumption about how to combine known speaker name fields to match a normalized claim speaker name. This strikes me as a little dangerous.

Now, are we likely ever going to change how names are composed? No. But should we have to remember that we're doing this composition in raw SQL in a newsletter file? I don't think so.

Short of being able to replace our raw SQL with a Sequelize Query (which we determined it's too complex for), should KnownSpeaker just combine first/last names into a single full name field during sync? Are we gaining anything by keeping them separate, given every other part of our system will be dealing with full names?