codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

ContentExtractor.nodes_to_check doesn't recognize the "right" <p> elements in html article #952

Open tomer2406 opened 2 years ago

tomer2406 commented 2 years ago

Hello, I'm using newspaper3k package to parse the following article: https://spectrum.ieee.org/3d-printed-meat In debugged it until I reached the code section of ContentExtractor.nodes_to_check method and I saw that when it execute the following: items = self.parser.getElementsByTag(doc, tag=tag) when tag = 'p' I get 75 elements which do not include the article text, compared to when I'm using BeautifulSoup with soup.find_all('p') I get 76 elements with the right text.

can you please help me to understand the problem? Thank you.