Data4Democracy / internal-displacement

Studying news events and internal displacement.
43 stars 27 forks source link

Change the way dates are extracted and used for reports. #129

Closed simonb83 closed 7 years ago

simonb83 commented 7 years ago

This PR makes a fairly significant change in the way that dates are extracted and processed for reports.

Rather than trying to associate particular dates with particular events, each report will have a date range associated, that is to say that articles will be said to refer to events that happened in a particular range.

  1. Dates are no longer extracted at the same time as other information
  2. The new process is as follows:
    • Once an article has been processed, then if it has reports, all possible dates are extracted from the text (using Spacy named entity recognition)
    • Each possible date is converted to an absolute datetime
    • Certain dates are excluded (i.e. if they are in the future or too far in the past)
    • The date-span for each report is then best on the earliest and latest extracted dates
    • If no dates are extracted, the publication date is used

There are still a number of opportunities to improve conversion to absolute datetimes.