codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
MIT License
14.06k stars 2.11k forks source link

how to use html file in newspaper3k as it work with url page #790

Open MeetH15 opened 4 years ago

MeetH15 commented 4 years ago

please help me @yprez

animesh-sharama commented 4 years ago

It would be nice if someone could point to an example that shows how to use html file.

iwpnd commented 4 years ago
from newspaper import Article

your_html = """
<!DOCTYPE html>

  <meta charset="utf-8">
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">


  <p>Hello, world!</p>

  <script src=""></script>
  <script src="js/script.js"></script>


article = Article("random_url")
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")
MeetH15 commented 4 years ago

what is response.url @iwpnd ??

iwpnd commented 4 years ago

what is response.url @iwpnd ??

just some random url. it will not be used as you provide an input_html anyways.

MeetH15 commented 4 years ago

when i run it show 'None' as an output @iwpnd can u show output like what u get in ur screen

iwpnd commented 4 years ago

I was showing you how to use an HTML, not providing you a valid HTML newspaper article.

ashkaushik commented 4 years ago

I m trying to get exactly same result as It was using demo url:

but not getting same result help?

johnbumgarner commented 3 years ago

I m trying to get exactly same result as It was using demo url:

but not getting same result help?

Are you wanting to output the extracted content from a news source to a HTML page like the example shows?

taga93 commented 3 years ago

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

johnbumgarner commented 3 years ago

I want to extract date, title and text from article that I passed as HTML. I have tried this

article = Article("random_url") #I have tried with just empty "" article = article.parse() #I have tried just this also article.parse()

But Im getting the error:

“TypeError: unhashable type: 'slice'”

What should I do?

Look at this section of the overview document that I published on using Newspaper.

imrek commented 3 years ago
from newspaper import Article

your_html = """
<!DOCTYPE html>

  <meta charset="utf-8">
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">


  <p>Hello, world!</p>

  <script src=""></script>
  <script src="js/script.js"></script>


article = Article("random_url")
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.


The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.

AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/ in fulltext(html, language)
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/ in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/ in add_siblings(self, top_node)
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/ in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/ in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'
johnbumgarner commented 3 years ago
from newspaper import Article

your_html = """
<!DOCTYPE html>

  <meta charset="utf-8">
  <meta name="author" content="">
  <meta name="description" content="">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link href="css/normalize.css" rel="stylesheet">
  <link href="css/style.css" rel="stylesheet">


  <p>Hello, world!</p>

  <script src=""></script>
  <script src="js/script.js"></script>


article = Article("random_url")
article = article.parse()

if you only want to get the fulltext of an article.

from newspaper import fulltext

text = fulltext(your_html, language="your supported language")

When I run your first code sample, the final value of article is None.


The 2nd option (fulltext), when applied to your HTML sample, triggers an AttributeError.

AttributeError                            Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")

/usr/local/lib/python3.8/dist-packages/newspaper/ in fulltext(html, language)
     90     top_node = extractor.calculate_best_node(doc)
---> 91     top_node = extractor.post_cleanup(top_node)
     92     text, article_html = output_formatter.get_formatted(top_node)
     93     return text

/usr/local/lib/python3.8/dist-packages/newspaper/ in post_cleanup(self, top_node)
   1038         or paras with no gusto; add adjacent nodes which look contenty
   1039         """
-> 1040         node = self.add_siblings(top_node)
   1041         for e in self.parser.getChildren(node):
   1042             e_tag = self.parser.getTag(e)

/usr/local/lib/python3.8/dist-packages/newspaper/ in add_siblings(self, top_node)
    868     def add_siblings(self, top_node):
--> 869         baseline_score_siblings_para = self.get_siblings_score(top_node)
    870         results = self.walk_siblings(top_node)
    871         for current_node in results:

/usr/local/lib/python3.8/dist-packages/newspaper/ in get_siblings_score(self, top_node)
    924         paragraphs_number = 0
    925         paragraphs_score = 0
--> 926         nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
    928         for node in nodes_to_check:

/usr/local/lib/python3.8/dist-packages/newspaper/ in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
    121                 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
    122                 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123         elems = node.xpath(selector, namespaces=NS)
    124         # remove the root node
    125         # if we have a selection tag

AttributeError: 'NoneType' object has no attribute 'xpath'


the first code example didn't follow the syntax of the code example that I posted in my overview document. Please review my code example for processing offline HTML content.

I have never used Fulltext, so I would have to review the code for NewsPaper to see how this function works.

johnbumgarner commented 3 years ago

@imrek I also looked at the function fulltext. I'm not sure what it does different than article.text. According to the code base the syntax of the function requires article.html and not _yourhtml. I tested the function with multiple news sites and received no errors. Also the length of article.text and the output of fulltext_ were the same.