commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
312 stars 34 forks source link

Extract publishing date #18

Open fhamborg opened 7 years ago

fhamborg commented 7 years ago

It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to get that information. However, during the crawling process at least some webpages would offer this information, e.g. the time stamp within the RSS feed <pubDate>Thu, 25 Dec 2014 02:10:00 +0900</pubDate> or within the sitemap <news:publication_date>2016-12-09T16:18:48Z</news:publication_date>

sebastian-nagel commented 6 years ago

Status update:

sebastian-nagel commented 5 years ago

The project now uses crawler-commons 1.0 which brings full support for all sitemap extensions, including news sitemaps. The <news:publication_date> is now used to skip older news articles (with the current configuration older than 30 days). Next steps to implement would be: