Closed hoIIer closed 3 years ago
good find, thanks for filing this. we will take a look
great, thanks for your work, much appreciated!
I also found forbes links not working:
This may be due to user agent. Try this, it should work
url = 'https://www.seattletimes.com/seattle-news/politics/the-irony-of-the-no-cop-chop-it-showed-how-much-we-still-need-the-police-after-all/'
article = Article(url, browser_user_agent='Mozilla')
article.download()
article.parse()
this one doesn't pick up "top image": https://www.bloomberg.com/news/articles/2020-07-01/anduril-startup-backed-by-peter-thiel-is-valued-at-1-9-billion
@codelucas maybe we should start a page to report publishers that don't work? sorry not trying to spam this issue with other ones besides seattletimes :)
There are several sites that seem to have this issue. I noticed this when trying to parse NYTimes, https://github.com/codelucas/newspaper/issues/645. From what I found, there is a decision to figure out where to start parsing an article and sometimes it doesn't get it right. In doing so leads to getting partial or no content from the page.
@sagunsh thanks that solved it, although I may be hitting a separate issue where some sites perhaps block requests from e.g. aws?
Greetings, I've noticed newspaper failing to parse a link from Seattletimes.com. How can I dive a bit deeper to find out what's wrong?
urls:
https://www.seattletimes.com/seattle-news/politics/the-irony-of-the-no-cop-chop-it-showed-how-much-we-still-need-the-police-after-all/
https://www.seattletimes.com/seattle-news/health/coronavirus-daily-news-updates-june-24-what-to-know-today-about-covid-19-in-the-seattle-area-washington-state-and-the-world/
https://www.seattletimes.com/entertainment/visual-arts/seattle-art-museum-isnt-dissolving-despite-a-fake-news-release-with-a-point-to-make-imagining-so/