AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
481 stars 49 forks source link

passing page sourse(html) instead of url #211

Closed AndyTheFactory closed 10 months ago

AndyTheFactory commented 1 year ago

Issue by akashmondal1810 Fri May 25 11:58:41 2018 Originally opened as https://github.com/codelucas/newspaper/issues/571


i want to use newspaper lib. but instead of use it by passing url of article i want to to pass article page sourse. Is there any way I can do that ????

AndyTheFactory commented 1 year ago

Comment by iwpnd Mon May 28 11:09:56 2018


from newspaper import fulltext

then use

fulltext(html, language)

with html as text and language as the 2 digit language code.

AndyTheFactory commented 1 year ago

Comment by akashmondal1810 Tue May 29 06:17:15 2018


thanks it worked

From: "Ben" notifications@github.com To: "codelucas/newspaper" newspaper@noreply.github.com Cc: "Akash Mondal" AKASHMONDALCIVIL@IITKGP.AC.IN, "Author" author@noreply.github.com Sent: Monday, May 28, 2018 4:40:05 PM Subject: Re: [codelucas/newspaper] passing page sourse(html) instead of url (#571)

from newspaper import fulltext then use fulltext(html, language)

with html as text and language as the 2 digit language code.

— You are receiving this because you authored the thread. Reply to this email directly, [ https://github.com/codelucas/newspaper/issues/571#issuecomment-392496444 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AbNWecloHvqnniH4vXgyLcpGbcd1IAZ2ks5t29sNgaJpZM4UN2mB | mute the thread ] .

AndyTheFactory commented 1 year ago

Comment by iwpnd Thu Jun 7 19:48:33 2018


close

AndyTheFactory commented 1 year ago

Comment by chsuong Tue Jul 10 19:31:30 2018


It didn't work for me. Below is an example with an example html used https://github.com/codelucas/newspaper/issues/291. Any help would be sincerely appreciated!


my_html='''<!DOCTYPE html>
<html>
<body>
<p>My first paragraph.</p>
</body>
</html>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(my_html, "lxml")
html_text=soup.get_text()

from newspaper import fulltext
text = fulltext(html_text,'en')

Traceback (most recent call last):

  File "<ipython-input-33-7d1b9f3a7dec>", line 2, in <module>
    text = fulltext(html_text,'en')

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/api.py", line 91, in fulltext
    top_node = extractor.post_cleanup(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
    node = self.add_siblings(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 869, in add_siblings
    baseline_score_siblings_para = self.get_siblings_score(top_node)

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
    nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')

  File "/Users/chs/anaconda/lib/python3.5/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
    elems = node.xpath(selector, namespaces=NS)

AttributeError: 'NoneType' object has no attribute 'xpath'
AndyTheFactory commented 1 year ago

Comment by lordrisborik Wed Jan 27 03:38:15 2021


chsuong , how did you solve above issue eventually? I am fraid I have to search/extract keywords from locally stored text/contents

AndyTheFactory commented 10 months ago

error does not occure in 0.9.2