codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.05k stars 2.11k forks source link

get_publish_date() Strategy 3 not implemented #521

Open awiebe opened 6 years ago

awiebe commented 6 years ago

https://github.com/codelucas/newspaper/blob/c0eed1a571ab91382ca7e86767b2c26e7e59bbd2/newspaper/extractors.py#L179

Strategy 3 (search body for date) is not implemented, so currently bbc articles, e.g. http://www.bbc.com/news/business-43133853

Which do not have any data in the URL or meta tags, do not have any publish date even though the date and time is clearly in the HTML as

<div class="date date--v2" data-seconds="1519152107" data-datetime="20 February 2018" data-timestamp-inserted="true">20 February 2018</div>
codelucas commented 6 years ago

Thanks for reminding us of this, will leave this task open for anyone to take it on.

bitmoji

torbenbrodt commented 6 years ago

BBC uses schema.org I have tested it including #385 And it works (tested with https://www.bbc.com/news/business-43133853)