biglocalnews / warn-transformer

Consolidate, enrich and republish the data gathered by warn-scraper
https://warn-transformer.readthedocs.io
Apache License 2.0
4 stars 3 forks source link

Date simplification technique is too simple #214

Open stucka opened 9 months ago

stucka commented 9 months ago

Several transformers try to grab the first hunk of text before a space to determine a date. That's not a great approach if that first hunk of text is too small to be a valid date and also too small to be a good quasi-unique identifier.

In New York, for example, there's an American Airlines entry for "2 /12 /2021" that comes in as simply "2", which could conflict with other bad entries.

If the first hunk is too small to be a date (e.g., 1/1/23 for six characters) the whole string should probably be passed for a match.

value = value.split()[0].replace(",", "").replace(";", "") Could be something like:

patched= value.split()[0].replace(",", "").replace(";", "")
if len(patched) >= 6:
    value = patched
stucka commented 9 months ago

Ohio and New York appear to use something similar.