Open trevlovett opened 6 years ago
I didn't have this problem when running my codes. I'm facing the same attribute error since yesterday.
I get the same with Python 3.7.
Hi @codelucas, I have found out that the URL: https://capitalandgrowth.org/questions/1250/hair-salon-appointments-what-is-the-best-exit-inte.html describes it's full-body text inside its DIV tag, while the library only search in [ 'p', 'pre' ,'td'] tags. I hope this will solve more similar issues.
Facing the same problem here. I've built a simple script with Newspaper3k to scrape content, authors, etc of a URL given by the user - it works fine. But when I try to scrape a list of URLs from a .txt file, newspaper throws this same error: AttributeError: 'NoneType' object has no attribute 'xpath'
Full error below:
File "newsscraper.py", line 13, in <module>
text = fulltext(html)
File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext
top_node = extractor.post_cleanup(top_node)
File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
node = self.add_siblings(top_node)
File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings
baseline_score_siblings_para = self.get_siblings_score(top_node)
File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
elems = node.xpath(selector, namespaces=NS)
AttributeError: 'NoneType' object has no attribute 'xpath'
Is there a workaround?
Facing the same problem here. I've built a simple script with Newspaper3k to scrape content, authors, etc of a URL given by the user - it works fine. But when I try to scrape a list of URLs from a .txt file, newspaper throws this same error:
AttributeError: 'NoneType' object has no attribute 'xpath'
Full error below:
File "newsscraper.py", line 13, in <module> text = fulltext(html) File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext top_node = extractor.post_cleanup(top_node) File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup node = self.add_siblings(top_node) File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings baseline_score_siblings_para = self.get_siblings_score(top_node) File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score nodes_to_check = self.parser.getElementsByTag(top_node, tag='p') File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag elems = node.xpath(selector, namespaces=NS) AttributeError: 'NoneType' object has no attribute 'xpath'
Is there a workaround?
Will you be able to post the screenshots or snippets of the code and the file?
Sure. It is very simple stuff, still developing. I'm not using anything fancy, just playing around with methods from newspaper3k. I was reading the documentation and decided I should learn about web scraping by building something useful to my work. By now I have two working scripts: sources.py
will build a list of news sources, extract a list of URLs of the content published by them, and save those URLs to a .txt file (I am just doing python sources.py > articles.txt
by now). The actual list contains hundreds of sources, so here goes a simplified version:
import newspaper
from newspaper import news_pool
folha = newspaper.build('https://www.folha.uol.com.br/', language='pt', memoize_articles=False)
estadao = newspaper.build('https://www.estadao.com.br/', language='pt', memoize_articles=False)
intercept = newspaper.build('https://theintercept.com/brasil/', language='pt', memoize_articles=False)
piaui = newspaper.build('https://piaui.folha.uol.com.br/', language='pt', memoize_articles=False)
papers = [folha, estadao, intercept, piaui]
def poolNews():
news_pool.set(papers, threads_per_source=2)
news_pool.join()
# def paperSize():
# for paper in papers:
# print(paper.url, '---', paper.size(), 'artigos disponíveis')
def urlList():
for article in folha.articles:
print(article.url)
for article in estadao.articles:
print(article.url)
for article in intercept.articles:
print(article.url)
for article in piaui.articles:
print(article.url)
poolNews()
# paperSize()
urlList()
The other working script, newscraper.py
asks for user input of one URL and then extract and print its authors, keywords, summary, etc, all standard methods described in the newspaper3k docs:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
def newscraper():
download,
parse,
nlp
def blank():
print('\n')
print('-' * 78)
print('\n')
def printOut():
print('\n')
print('Título da matéria:' + '\n')
print(title)
blank()
print('Data de publicação:' + '\n')
print(publish_date)
blank()
print('Autoria do artigo:' + '\n') # Sometimes this doesn't work well.
print(authors)
blank()
print('Palavras-chave:' + '\n')
print(keywords)
blank()
print('Sumário:' + '\n')
print(summary)
blank()
print('Texto completo:' + '\n')
print(text)
# print(a.text) # If method above fails, comment it out and try this one.
blank()
print(url)
blank()
# print('HTML:' + '\n')
# print(a.html) # Extracts full HTML code of the webpage.
newscraper()
printOut()
Both work fine. The error I described earlier appears when I try to feed the .txt file with all the URLs to a modified version of newscraper.py
. What I would like to do is to extract and print content, authors, etc of a list of hundreds of article URLs provided as output of sources.py
. This modified script goes in the same direction as newscraper.py
, I just changed the user input for a method that is able to read the URLs in the .txt one at a time:
with open('articles.txt', 'r') as f:
for line in f:
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
The last one throws the error I was describing earlier: AttributeError: 'NoneType' object has no attribute 'xpath'
Hi @reactionhashs, I checked your last code with a few URLs in articles.txt. The error you are facing is because your requested some URL's articles don't contain the Full Body text in the P, PRE, and TD tags, while it's present in the DIV tag. So, a possible solution for you currently might be to ignore the URLs by using exception handling in python until @codelucas accepts the above-proposed solution.
Till then I will recommend to use code below.
from newspaper import Article
from newspaper import fulltext
import requests
with open('article.txt', 'r') as :
for line in f:
print(line)
url = line.rstrip("\n")
a = Article(url, language='pt')
html = equests.get(url).text
try:
text = fulltext(html)
download = a.download()
parse = .parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
except Exception as e:
print("Error "+str(e))
It still depends on your project, how important it is for you to include every article is.
@Ask149, thanks for taking the time to help me out. Gonna try to execute the code and I'll let you know how it goes. It is not completely necessary for me to scrape every single article, so I think this is going to work for me.
I am having the same issue with a very simple html page:
$ python
Python 3.7.0 (default, Jun 28 2018, 07:39:16)
>>> import requests; import newspaper
>>> from newspaper import fulltext
>>> html = requests.get('http://localhost/.../articles/group2_2.html').text
>>> html
"<html>\n<body>\n<h1>Hi this is a fake article</h1>\n\n<p>Some random text</p>\n\n\n<a href='group2_3.html'>to 3</a>\n</body>\n</html>\n"
>>> fulltext(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mypath../lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext
top_node = extractor.post_cleanup(top_node)
File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
node = self.add_siblings(top_node)
File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings
baseline_score_siblings_para = self.get_siblings_score(top_node)
File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
File "/mypath../lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
elems = node.xpath(selector, namespaces=NS)
AttributeError: 'NoneType' object has no attribute 'xpath'
For some reason it does work for another, very similar fake article:
>>> html = requests.get('http://localhost/.../articles/fake_article1.html').text
>>> fulltext(html)
'Hi this is a fake article 1\n\nAnd it is about Basketball'
>>> html
"<h1>Hi this is a fake article 1</h1>\n\n<p>And it is about Basketball</p>\n\n<a href='fake_article2.html'>Link to Fake Article 2</a>\n"
>>>
Thanks for finding the bug guys, really appreciate this!
Hi @reactionhashs, I checked your last code with a few URLs in articles.txt. The error you are facing is because your requested some URL's articles don't contain the Full Body text in the P, PRE, and TD tags, while it's present in the DIV tag. So, a possible solution for you currently might be to ignore the URLs by using exception handling in python until @codelucas accepts the above-proposed solution.
Seems like this div tag is not the issue. I had my scraper built with newspaper and I'm using fulltext() to get the content of the url like below gives me error AttributeError: 'NoneType' object has no attribute 'xpath'
from newspaper import fulltext
import cloudscraper
url = 'http://www.pharmafile.com/news/544464/novartis-reveals-hard-hitting-five-year-data-its-gene-therapy-zolgensma-spinal-muscular-'
scraper = cloudscraper.create_scraper()
ht = scraper.get(url)
txt = ht.text
print(fulltext(txt))
But the above piece of code works fine for this url "https://www.eurekalert.org/pub_releases/2020-03/dci-nrs031720.php" though the content is present in
tags succeeding the
Please help in resolving this issue. If there's already a resolution on this please guide me through that. This issue is priority in my current work and I'm trying this since a week but nothing really helps.
Thanks in advance.
Same issue for me, still no workaround?
I am experiencing the same issue, as @shashank7596 mentioned, even though the texts are written in div, it raises this AttributeError.
Any suggestions?
AttributeError: 'NoneType' object has no attribute 'xpath'
Repro with python3:
File "/usr/local/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext top_node = extractor.post_cleanup(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup node = self.add_siblings(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings baseline_score_siblings_para = self.get_siblings_score(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score nodes_to_check = self.parser.getElementsByTag(top_node, tag='p') File "/usr/local/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag elems = node.xpath(selector, namespaces=NS) AttributeError: 'NoneType' object has no attribute 'xpath'