Improve text normalization for non-ASCII - Githubissues

Howdju / howdju

Monorepo for the Howdju crowdsourced fact checking and summarization platform

https://www.howdju.com

GNU Affero General Public License v3.0

5 stars 2 forks source link

Improve text normalization for non-ASCII #442

Open carlgieringer opened 1 year ago

carlgieringer commented 1 year ago

We should not debur in our normalization, as it can create ambiguity in some languages or with emoji.
We should split toSlug and normalizeText more since they have different purposes (toSlug can erase characters because its just for display in a link, and so deburring is okay, whereas normalizeText needs a better way to identify unique strings.

To achieve this we'd probably have to do something like the following:

Add a new temporary column for holding the old algorithm.
Update the code to dual read and write (including uniqueness checks) from the old and new column. New algo to old column (so that we can keep it once we are done) and old algo to new column.
Backfill the new algo to the old column and the old algo to the new column.
Update the code to refer only to the old column (holding the new algo.)
Delete the old column.

carlgieringer commented 1 year ago

We could probably have different normalizations for quotes (which need to be more literal and so need to account for, say, emoji) and propositions (which tend to be more generic, and probably should not differ based on, say, emoji.)