AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
464 stars 45 forks source link

get_publish_date() Strategy 3 not implemented #176

Open AndyTheFactory opened 12 months ago

AndyTheFactory commented 12 months ago

Issue by awiebe Wed Feb 21 07:38:28 2018 Originally opened as https://github.com/codelucas/newspaper/issues/521


https://github.com/codelucas/newspaper/blob/c0eed1a571ab91382ca7e86767b2c26e7e59bbd2/newspaper/extractors.py#L179

Strategy 3 (search body for date) is not implemented, so currently bbc articles, e.g. http://www.bbc.com/news/business-43133853

Which do not have any data in the URL or meta tags, do not have any publish date even though the date and time is clearly in the HTML as

<div class="date date--v2" data-seconds="1519152107" data-datetime="20 February 2018" data-timestamp-inserted="true">20 February 2018</div>
AndyTheFactory commented 12 months ago

Comment by codelucas Thu Feb 22 05:13:08 2018


Thanks for reminding us of this, will leave this task open for anyone to take it on.

bitmoji

AndyTheFactory commented 12 months ago

Comment by torbenbrodt Fri Jul 13 12:16:59 2018


BBC uses schema.org I have tested it including #385 And it works (tested with https://www.bbc.com/news/business-43133853)