Open MeetH15 opened 4 years ago
It would be nice if someone could point to an example that shows how to use html file.
from newspaper import Article
your_html = """
index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title></title>
<meta name="author" content="">
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="css/normalize.css" rel="stylesheet">
<link href="css/style.css" rel="stylesheet">
</head>
<body>
<p>Hello, world!</p>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
<script src="js/script.js"></script>
</body>
</html>
"""
article = Article("random_url")
article.download(input_html=your_html)
article = article.parse()
if you only want to get the fulltext of an article.
from newspaper import fulltext
text = fulltext(your_html, language="your supported language")
what is response.url @iwpnd ??
what is response.url @iwpnd ??
just some random url. it will not be used as you provide an input_html
anyways.
when i run it show 'None' as an output @iwpnd can u show output like what u get in ur screen
I was showing you how to use an HTML, not providing you a valid HTML newspaper article.
I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html
but not getting same result help?
I m trying to get exactly same result as It was using demo url: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2014%2F01%2F12%2Fworld%2Fasia%2Fnorth-korea-charles-smith%2Findex.html
but not getting same result help?
Are you wanting to output the extracted content from a news source to a HTML page like the example shows?
I want to extract date, title and text from article that I passed as HTML. I have tried this
article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()
But Im getting the error:
“TypeError: unhashable type: 'slice'”
What should I do?
I want to extract date, title and text from article that I passed as HTML. I have tried this
article = Article("random_url") #I have tried with just empty "" article.download(input_html=your_html) article = article.parse() #I have tried just this also article.parse()
But Im getting the error:
“TypeError: unhashable type: 'slice'”
What should I do?
Look at this section https://github.com/johnbumgarner/newspaper3_usage_overview#extraction-from-offline-html-files of the overview document that I published on using Newspaper.
from newspaper import Article your_html = """ index.html <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title></title> <meta name="author" content=""> <meta name="description" content=""> <meta name="viewport" content="width=device-width, initial-scale=1"> <link href="css/normalize.css" rel="stylesheet"> <link href="css/style.css" rel="stylesheet"> </head> <body> <p>Hello, world!</p> <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script> <script src="js/script.js"></script> </body> </html> """ article = Article("random_url") article.download(input_html=your_html) article = article.parse()
if you only want to get the fulltext of an article.
from newspaper import fulltext text = fulltext(your_html, language="your supported language")
When I run your first code sample, the final value of article
is None
.
UPDATE:
The 2nd option (fulltext
), when applied to your HTML sample, triggers an AttributeError
.
AttributeError Traceback (most recent call last)
<ipython-input-5-fb793e263c15> in <module>
----> 1 text = fulltext(html, language="en")
/usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language)
89
90 top_node = extractor.calculate_best_node(doc)
---> 91 top_node = extractor.post_cleanup(top_node)
92 text, article_html = output_formatter.get_formatted(top_node)
93 return text
/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node)
1038 or paras with no gusto; add adjacent nodes which look contenty
1039 """
-> 1040 node = self.add_siblings(top_node)
1041 for e in self.parser.getChildren(node):
1042 e_tag = self.parser.getTag(e)
/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node)
867
868 def add_siblings(self, top_node):
--> 869 baseline_score_siblings_para = self.get_siblings_score(top_node)
870 results = self.walk_siblings(top_node)
871 for current_node in results:
/usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node)
924 paragraphs_number = 0
925 paragraphs_score = 0
--> 926 nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
927
928 for node in nodes_to_check:
/usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex)
121 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase)
122 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower())
--> 123 elems = node.xpath(selector, namespaces=NS)
124 # remove the root node
125 # if we have a selection tag
AttributeError: 'NoneType' object has no attribute 'xpath'
from newspaper import Article your_html = """ index.html <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title></title> <meta name="author" content=""> <meta name="description" content=""> <meta name="viewport" content="width=device-width, initial-scale=1"> <link href="css/normalize.css" rel="stylesheet"> <link href="css/style.css" rel="stylesheet"> </head> <body> <p>Hello, world!</p> <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script> <script src="js/script.js"></script> </body> </html> """ article = Article("random_url") article.download(input_html=your_html) article = article.parse()
if you only want to get the fulltext of an article.
from newspaper import fulltext text = fulltext(your_html, language="your supported language")
When I run your first code sample, the final value of
article
isNone
.UPDATE:
The 2nd option (
fulltext
), when applied to your HTML sample, triggers anAttributeError
.AttributeError Traceback (most recent call last) <ipython-input-5-fb793e263c15> in <module> ----> 1 text = fulltext(html, language="en") /usr/local/lib/python3.8/dist-packages/newspaper/api.py in fulltext(html, language) 89 90 top_node = extractor.calculate_best_node(doc) ---> 91 top_node = extractor.post_cleanup(top_node) 92 text, article_html = output_formatter.get_formatted(top_node) 93 return text /usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in post_cleanup(self, top_node) 1038 or paras with no gusto; add adjacent nodes which look contenty 1039 """ -> 1040 node = self.add_siblings(top_node) 1041 for e in self.parser.getChildren(node): 1042 e_tag = self.parser.getTag(e) /usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in add_siblings(self, top_node) 867 868 def add_siblings(self, top_node): --> 869 baseline_score_siblings_para = self.get_siblings_score(top_node) 870 results = self.walk_siblings(top_node) 871 for current_node in results: /usr/local/lib/python3.8/dist-packages/newspaper/extractors.py in get_siblings_score(self, top_node) 924 paragraphs_number = 0 925 paragraphs_score = 0 --> 926 nodes_to_check = self.parser.getElementsByTag(top_node, tag='p') 927 928 for node in nodes_to_check: /usr/local/lib/python3.8/dist-packages/newspaper/parsers.py in getElementsByTag(cls, node, tag, attr, value, childs, use_regex) 121 trans = 'translate(@%s, "%s", "%s")' % (attr, string.ascii_uppercase, string.ascii_lowercase) 122 selector = '%s[contains(%s, "%s")]' % (selector, trans, value.lower()) --> 123 elems = node.xpath(selector, namespaces=NS) 124 # remove the root node 125 # if we have a selection tag AttributeError: 'NoneType' object has no attribute 'xpath'
@imrek
the first code example didn't follow the syntax of the code example that I posted in my overview document. Please review my code example for processing offline HTML content.
I have never used Fulltext, so I would have to review the code for NewsPaper to see how this function works.
@imrek I also looked at the function fulltext. I'm not sure what it does different than article.text. According to the code base the syntax of the function requires article.html and not _yourhtml. I tested the function with multiple news sites and received no errors. Also the length of article.text and the output of fulltext_ were the same.
please help me @yprez