codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.14k stars 2.12k forks source link

Unable to parse Seattletimes.com #818

Closed hoIIer closed 3 years ago

hoIIer commented 4 years ago

Greetings, I've noticed newspaper failing to parse a link from Seattletimes.com. How can I dive a bit deeper to find out what's wrong?

urls:

https://www.seattletimes.com/seattle-news/politics/the-irony-of-the-no-cop-chop-it-showed-how-much-we-still-need-the-police-after-all/

https://www.seattletimes.com/seattle-news/health/coronavirus-daily-news-updates-june-24-what-to-know-today-about-covid-19-in-the-seattle-area-washington-state-and-the-world/

https://www.seattletimes.com/entertainment/visual-arts/seattle-art-museum-isnt-dissolving-despite-a-fake-news-release-with-a-point-to-make-imagining-so/

codelucas commented 4 years ago

good find, thanks for filing this. we will take a look

hoIIer commented 4 years ago

great, thanks for your work, much appreciated!

I also found forbes links not working:

https://www.forbes.com/sites/jackbrewster/2020/06/27/trump-most-republicans-silent-about-reports-russia-paid-taliban-to-kill-us-troops/#7acb8c7c6a45

https://www.forbes.com/sites/adrianbridgwater/2020/06/17/world-economic-forum-data-flows-project-hopes-to-globalize-information-exchange/?ss=big-data#523c3904745f

sagunsh commented 4 years ago

This may be due to user agent. Try this, it should work

url = 'https://www.seattletimes.com/seattle-news/politics/the-irony-of-the-no-cop-chop-it-showed-how-much-we-still-need-the-police-after-all/'
article = Article(url, browser_user_agent='Mozilla')
article.download()
article.parse()
hoIIer commented 4 years ago

Another:

https://www.nasaspaceflight.com/2020/06/starship-sn5-test-campaign/

hoIIer commented 4 years ago

this one doesn't pick up "top image": https://www.bloomberg.com/news/articles/2020-07-01/anduril-startup-backed-by-peter-thiel-is-valued-at-1-9-billion

@codelucas maybe we should start a page to report publishers that don't work? sorry not trying to spam this issue with other ones besides seattletimes :)

mmaybeno commented 4 years ago

There are several sites that seem to have this issue. I noticed this when trying to parse NYTimes, https://github.com/codelucas/newspaper/issues/645. From what I found, there is a decision to figure out where to start parsing an article and sometimes it doesn't get it right. In doing so leads to getting partial or no content from the page.

hoIIer commented 3 years ago

@sagunsh thanks that solved it, although I may be hitting a separate issue where some sites perhaps block requests from e.g. aws?