codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Incorrect article text extraction (using hidden text from the one of the blocks in the side bar instead of the main content) #281

Open hus787 opened 8 years ago

hus787 commented 8 years ago

https://www.upmbiofore.fi/solut-kasvavat-nanosellussa/ (html: solut-kasvavat-nanosellussa.txt) https://www.upmbiofore.fi/eu-rahoitusta-biokemikaalien-tutkimiseen/ (html: eu-rahoitusta-biokemikaalien-tutkimiseen.txt) https://www.upmbiofore.fi/kohti-kestavaa-taloutta/ (html: kohti-kestavaa-taloutt.txt)

In all the articles above (html included for future reference) newspaper is extracting the incorrect text from the side bar which is not even visible ("display: none") for the .text of the article (after download and parse) and likewise for the summary after running nlp

ppawiggers commented 5 years ago

Old issue, but still valid. I experience the same issue; newspaper3k should exclude hidden (display: none) elements as potential article content.

I'll see if I can find the time to submit a pull request with this fix.