codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Not able to crawl Javascript-disabled webpages #902

Open AmeyHengle opened 3 years ago

AmeyHengle commented 3 years ago

Hello guys, I am using newspaper3k to crawl text from webpages. I noticed that the article.parse() function is not able to read the content of webpages which have Javascript disabled.

Following is the code that I am using:

` url = "https://seekingalpha.com/article/4439299-russell-2000-leading-wall-street-lower-sell-iwm"

article = Article(url) article.download() article.parse() print(article.text) `

I am getting the following error:

`Javascript is Disabled

Your current browser configuration

is not compatible with this site.`

Does anyone know how to overcome this?

johnbumgarner commented 3 years ago

seekingalpha.com requires a login, so you need to pass that information to the website to harvest the article text. I haven't tried to use newspaper3k for this, but it should work because the package uses Python Requests.