codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

Added ability to scrape javascript intensive apps #941

Open Sosshi opened 2 years ago

Sosshi commented 2 years ago

The library was failing to scrape sites which have javascript code in it so i have added the ability to scrape such websites. So now it will be possible to scrape sites made with Vue, React and other JS intensive frameworks

banagale commented 2 years ago

This sounds compelling. I noticed your change sent includes conversion of single to double quotes and some formatting.

It would be easier to review these changes if it were limited only to materially changed lines.

While I do not believe the maintainer is approving PRs at this time, in general I'd suggest offering a PR with changes that only include what you're working on. Then consider a second that affects formatting in a more general sense.

--

All that said, I'm curious if you have test cases of sites that show article content using JS that fail using the main branch but pass using your change set.