codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

get publish date failed #891

Open saha65536 opened 3 years ago

saha65536 commented 3 years ago

test 1000 urls from 100 web site ,90% publis date is None..

johnbumgarner commented 3 years ago

This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.

Please provide some examples of the ones that failed.

saha65536 commented 3 years ago

https://jp.weforum.org/agenda/2021/03/nado-no-ha-ka/ https://jp.weforum.org/agenda/2021/04/kyasshuresu-no-wo-me-ajiawoyori-na-ni-kuniha/

johnbumgarner commented 3 years ago

Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.

  1. Pubdate from URL
  2. Pubdate from metadata
  3. Raw regex searches in the HTML + added heuristics

The first strategy fails, because the URL doesn't have a complete date.

The second strategy fails, because the target website has the published date in a tag not queried by newspaper.

The third strategy fails, because the date string contains Japanese characters - 2021年03月23日

The best option is for you to use BeautifulSoup to extract the date from the target website.