get publish date failed

saha65536 commented 3 years ago

test 1000 urls from 100 web site ，90% publis date is None..

johnbumgarner commented 3 years ago

This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.

Please provide some examples of the ones that failed.

saha65536 commented 3 years ago

https://jp.weforum.org/agenda/2021/03/nado-no-ha-ka/ https://jp.weforum.org/agenda/2021/04/kyasshuresu-no-wo-me-ajiawoyori-na-ni-kuniha/

johnbumgarner commented 3 years ago

Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.

Pubdate from URL
Pubdate from metadata
Raw regex searches in the HTML + added heuristics

The first strategy fails, because the URL doesn't have a complete date.

The second strategy fails, because the target website has the published date in a tag not queried by newspaper.

The third strategy fails, because the date string contains Japanese characters - 2021年03月23日

The best option is for you to use BeautifulSoup to extract the date from the target website.

codelucas / newspaper

get publish date failed #891