coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Use date_published field instead of date_updated #3

Closed avorio closed 3 years ago

avorio commented 3 years ago

Also, we use isoparse to interpret those dates, e.g.:

from dateutil import parser

clean_date = parser.isoparse(date_published)
PeterCiuffetti commented 3 years ago

10 mins of work

PeterCiuffetti commented 3 years ago

Resolved on 6 May 2021; and the DTT and Africa drops were republished, with the following note to Andre:

I have rebuilt the JSON for both the dtt org and the africa orgs to have date_published. However, it's not advisable to use the data_published for africa. In that version of coherencebot, the dates seem always to be the date harvested; the PDF date was not overriding this. It wasn't until the DTT drop that I was using the PDF meta date to override this. So all of the Africa JSON I looked at had 2021-02 dates. I do recall cases where african sites were using a Last-Modified date which was from earlier years when the file was saved on their site, so there may be usable dates among the set. This also happens on DTT content, but only in cases where there is no PDF date. So for example this file: coherencebot-bonn-institute-for-economic-and-social-research-539e24b8-487d-48d4-b499-d174c53f8548.json has good dates. But there are many examples where the date will be the date of harvest (2021-04 and there abouts)