AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
429 stars 37 forks source link

Incorrect article text extraction (using hidden text from the one of the blocks in the side bar instead of the main content) #48

Open AndyTheFactory opened 10 months ago

AndyTheFactory commented 10 months ago

Issue by hus787 Fri Aug 26 09:56:11 2016 Originally opened as https://github.com/codelucas/newspaper/issues/281


https://www.upmbiofore.fi/solut-kasvavat-nanosellussa/ (html: solut-kasvavat-nanosellussa.txt) https://www.upmbiofore.fi/eu-rahoitusta-biokemikaalien-tutkimiseen/ (html: eu-rahoitusta-biokemikaalien-tutkimiseen.txt) https://www.upmbiofore.fi/kohti-kestavaa-taloutta/ (html: kohti-kestavaa-taloutt.txt)

In all the articles above (html included for future reference) newspaper is extracting the incorrect text from the side bar which is not even visible ("display: none") for the .text of the article (after download and parse) and likewise for the summary after running nlp

AndyTheFactory commented 10 months ago

Comment by ppawiggers Fri Oct 26 08:44:55 2018


Old issue, but still valid. I experience the same issue; newspaper3k should exclude hidden (display: none) elements as potential article content.

I'll see if I can find the time to submit a pull request with this fix.