codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.05k stars 2.11k forks source link

passing page sourse(html) instead of url #571

Open akashmondal1810 opened 6 years ago

akashmondal1810 commented 6 years ago

i want to use newspaper lib. but instead of use it by passing url of article i want to to pass article page sourse. Is there any way I can do that ????

iwpnd commented 6 years ago

from newspaper import fulltext

then use

fulltext(html, language)

with html as text and language as the 2 digit language code.

akashmondal1810 commented 6 years ago

thanks it worked

From: "Ben" notifications@github.com To: "codelucas/newspaper" newspaper@noreply.github.com Cc: "Akash Mondal" AKASHMONDALCIVIL@IITKGP.AC.IN, "Author" author@noreply.github.com Sent: Monday, May 28, 2018 4:40:05 PM Subject: Re: [codelucas/newspaper] passing page sourse(html) instead of url (#571)

from newspaper import fulltext then use fulltext(html, language)

with html as text and language as the 2 digit language code.

— You are receiving this because you authored the thread. Reply to this email directly, [ https://github.com/codelucas/newspaper/issues/571#issuecomment-392496444 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AbNWecloHvqnniH4vXgyLcpGbcd1IAZ2ks5t29sNgaJpZM4UN2mB | mute the thread ] .

iwpnd commented 6 years ago

close

chsuong commented 6 years ago

It didn't work for me. Below is an example with an example html used https://github.com/codelucas/newspaper/issues/291. Any help would be sincerely appreciated!


my_html='''<!DOCTYPE html>
<html>
<body>
<p>My first paragraph.</p>
</body>
</html>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(my_html, "lxml")
html_text=soup.get_text()

from newspaper import fulltext
text = fulltext(html_text,'en')

Traceback (most recent call last):

  File "<ipython-input-33-7d1b9f3a7dec>", line 2, in <module>
    text = fulltext(html_text,'en')

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/api.py", line 91, in fulltext
    top_node = extractor.post_cleanup(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
    node = self.add_siblings(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 869, in add_siblings
    baseline_score_siblings_para = self.get_siblings_score(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
    nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
    elems = node.xpath(selector, namespaces=NS)

AttributeError: 'NoneType' object has no attribute 'xpath'
lordrisborik commented 3 years ago

chsuong , how did you solve above issue eventually? I am fraid I have to search/extract keywords from locally stored text/contents