alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Common format for publication dates #108

Open jemrobinson opened 5 years ago

jemrobinson commented 5 years ago

At the moment, we don't convert publication dates to a common format (eg. UTC). This is easy to do in pendulum (with .in_timezone('utc')) and in arrow (with .to('utc')).

Is this something we want to consider @martintoreilly?

martintoreilly commented 5 years ago

If we always got the timezone, then the correct answer is to store the full, unambiguous datetime as a datetimeoffset field. However, we are not guaranteed to recover a timezone for all articles (or even a time part for those we extract from hand-entered text). Therefore, the least misleading approach we can uniformly apply to all articles at the moment is probably to store the publication date as a date field with no reference to the timezone. We are currently storing the publication date as a datetime2 field with no reference to the timezone, which may potentially be misleading in some cases as it could easily be inferred that an ordering of these datetimes represented the order in which articles were published.

The basic issue is that, with no timezone in the publication datetime we parse from the article, we are unlikely to be able to infer the correct local time to use for a conversion to UTC (or another common timezone) in all cases. Specifically, with no time zone present this is likely to be done assuming that the date represents a local time in the timezone our parsing code is run on (which could reasonably change between runs that update our database).

If we did always have timezone info, there is an argument for standardising the timezone when storing dates as it makes it easier to eyeball without doing timezone comparisons in your head and protects against non timezone aware code mishandling the datetimeoffsets when extracting them from the database.