codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

Some article texts are not fully downloaded. #950

Open Jimchoo91 opened 2 years ago

Jimchoo91 commented 2 years ago

Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.

Here is an example website:

https://www.bbc.co.uk/news/world-48810070

Any idea why? Thanks.

bstivers commented 2 years ago

While it's more than a snippet, the full text of articles from Politico don't get pulled either.

I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.

I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.

johnbumgarner commented 1 year ago

The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.