Open hus787 opened 8 years ago
Old issue, but still valid. I experience the same issue; newspaper3k should exclude hidden (display: none
) elements as potential article content.
I'll see if I can find the time to submit a pull request with this fix.
https://www.upmbiofore.fi/solut-kasvavat-nanosellussa/ (html: solut-kasvavat-nanosellussa.txt) https://www.upmbiofore.fi/eu-rahoitusta-biokemikaalien-tutkimiseen/ (html: eu-rahoitusta-biokemikaalien-tutkimiseen.txt) https://www.upmbiofore.fi/kohti-kestavaa-taloutta/ (html: kohti-kestavaa-taloutt.txt)
In all the articles above (html included for future reference) newspaper is extracting the incorrect text from the side bar which is not even visible (
"display: none"
) for the.text
of the article (after download and parse) and likewise for the summary after running nlp