This PR makes a fairly significant change in the way that dates are extracted and processed for reports.
Rather than trying to associate particular dates with particular events, each report will have a date range associated, that is to say that articles will be said to refer to events that happened in a particular range.
Dates are no longer extracted at the same time as other information
The new process is as follows:
Once an article has been processed, then if it has reports, all possible dates are extracted from the text (using Spacy named entity recognition)
Each possible date is converted to an absolute datetime
Certain dates are excluded (i.e. if they are in the future or too far in the past)
The date-span for each report is then best on the earliest and latest extracted dates
If no dates are extracted, the publication date is used
There are still a number of opportunities to improve conversion to absolute datetimes.
This PR makes a fairly significant change in the way that dates are extracted and processed for reports.
Rather than trying to associate particular dates with particular events, each report will have a date range associated, that is to say that articles will be said to refer to events that happened in a particular range.
There are still a number of opportunities to improve conversion to absolute datetimes.