Open saha65536 opened 3 years ago
This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.
Please provide some examples of the ones that failed.
Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.
The first strategy fails, because the URL doesn't have a complete date.
The second strategy fails, because the target website has the published date in a tag not queried by newspaper.
The third strategy fails, because the date string contains Japanese characters - 2021年03月23日
The best option is for you to use BeautifulSoup to extract the date from the target website.
test 1000 urls from 100 web site ,90% publis date is None..