alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Missing dates for some redstate.com articles #341

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Date extraction in site config to be updated

Some of the articles have a date element like this:

<p class="byline">
<span class="diary text-uppercase"><a href="/stridentconservative " style="color:red;">DIARY / </a><a href="/stridentconservative">stridentconservative</a> //&nbsp</span>
Posted at 6:00 am on November 2, 2016 by <a href="/stridentconservative">stridentconservative</a>
</p>

Which stops the //p[@class="byline"]/text() from working because of the //&nbsp

edwardchalstrey1 commented 5 years ago

Decided to keep using the meta tag for dates since this works for almost all articles anyway