AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
478 stars 48 forks source link

newspaper.fulltext AttributeError #255

Open AndyTheFactory opened 1 year ago

AndyTheFactory commented 1 year ago

Issue by trevlovett Tue Oct 30 05:05:00 2018 Originally opened as https://github.com/codelucas/newspaper/issues/646


AttributeError: 'NoneType' object has no attribute 'xpath'

Repro with python3:

import requests import newspaper resp = requests.get("https://capitalandgrowth.org/questions/1250/hair-salon-appointments-what-is-the-best-exit-inte.html") newspaper.fulltext(resp.text)

File "/usr/local/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext top_node = extractor.post_cleanup(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup node = self.add_siblings(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings baseline_score_siblings_para = self.get_siblings_score(top_node) File "/usr/local/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score nodes_to_check = self.parser.getElementsByTag(top_node, tag='p') File "/usr/local/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag elems = node.xpath(selector, namespaces=NS) AttributeError: 'NoneType' object has no attribute 'xpath'

AndyTheFactory commented 1 year ago

Comment by mattborhan Sat Nov 10 05:46:51 2018


I didn't have this problem when running my codes. I'm facing the same attribute error since yesterday.

AndyTheFactory commented 1 year ago

Comment by tsoernes Thu Nov 22 13:06:02 2018


I get the same with Python 3.7.

AndyTheFactory commented 1 year ago

Comment by Ask149 Sat Dec 29 15:22:06 2018


Hi @codelucas, I have found out that the URL: https://capitalandgrowth.org/questions/1250/hair-salon-appointments-what-is-the-best-exit-inte.html describes it's full-body text inside its DIV tag, while the library only search in [ 'p', 'pre' ,'td'] tags. I hope this will solve more similar issues.

screenshot

AndyTheFactory commented 1 year ago

Comment by reactionhashs Sun Jan 6 20:21:24 2019


Facing the same problem here. I've built a simple script with Newspaper3k to scrape content, authors, etc of a URL given by the user - it works fine. But when I try to scrape a list of URLs from a .txt file, newspaper throws this same error: AttributeError: 'NoneType' object has no attribute 'xpath'

Full error below:


  File "newsscraper.py", line 13, in <module>
    text = fulltext(html)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext
    top_node = extractor.post_cleanup(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
    node = self.add_siblings(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings
    baseline_score_siblings_para = self.get_siblings_score(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
    nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
    elems = node.xpath(selector, namespaces=NS)
AttributeError: 'NoneType' object has no attribute 'xpath'

Is there a workaround?

AndyTheFactory commented 1 year ago

Comment by Ask149 Mon Jan 7 12:00:41 2019


Facing the same problem here. I've built a simple script with Newspaper3k to scrape content, authors, etc of a URL given by the user - it works fine. But when I try to scrape a list of URLs from a .txt file, newspaper throws this same error: AttributeError: 'NoneType' object has no attribute 'xpath'

Full error below:


  File "newsscraper.py", line 13, in <module>
    text = fulltext(html)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext
    top_node = extractor.post_cleanup(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
    node = self.add_siblings(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings
    baseline_score_siblings_para = self.get_siblings_score(top_node)
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
    nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
  File "/home/rafael/anaconda3/lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
    elems = node.xpath(selector, namespaces=NS)
AttributeError: 'NoneType' object has no attribute 'xpath'

Is there a workaround?

Will you be able to post the screenshots or snippets of the code and the file?

AndyTheFactory commented 1 year ago

Comment by reactionhashs Mon Jan 7 14:11:00 2019


Sure. It is very simple stuff, still developing. I'm not using anything fancy, just playing around with methods from newspaper3k. I was reading the documentation and decided I should learn about web scraping by building something useful to my work. By now I have two working scripts: sources.py will build a list of news sources, extract a list of URLs of the content published by them, and save those URLs to a .txt file (I am just doing python sources.py > articles.txt by now). The actual list contains hundreds of sources, so here goes a simplified version:

import newspaper
from newspaper import news_pool

folha = newspaper.build('https://www.folha.uol.com.br/', language='pt', memoize_articles=False)
estadao = newspaper.build('https://www.estadao.com.br/', language='pt', memoize_articles=False)
intercept = newspaper.build('https://theintercept.com/brasil/', language='pt', memoize_articles=False)
piaui = newspaper.build('https://piaui.folha.uol.com.br/', language='pt', memoize_articles=False)

papers = [folha, estadao, intercept, piaui]

def poolNews():
    news_pool.set(papers, threads_per_source=2)
    news_pool.join()

# def paperSize():
#     for paper in papers:
#         print(paper.url, '---', paper.size(), 'artigos disponíveis')

def urlList():
    for article in folha.articles:
        print(article.url)
    for article in estadao.articles:
        print(article.url)
    for article in intercept.articles:
        print(article.url)
    for article in piaui.articles:
        print(article.url)

poolNews()
# paperSize()
urlList()

The other working script, newscraper.py asks for user input of one URL and then extract and print its authors, keywords, summary, etc, all standard methods described in the newspaper3k docs:

from newspaper import Article
from newspaper import fulltext
import requests

url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary

def newscraper():
    download,
    parse,
    nlp

def blank():
    print('\n')
    print('-' * 78)
    print('\n')

def printOut():
    print('\n')
    print('Título da matéria:' + '\n')
    print(title)
    blank()
    print('Data de publicação:' + '\n')
    print(publish_date)
    blank()
    print('Autoria do artigo:' + '\n')  # Sometimes this doesn't work well.
    print(authors)
    blank()
    print('Palavras-chave:' + '\n')
    print(keywords)
    blank()
    print('Sumário:' + '\n')
    print(summary)
    blank()
    print('Texto completo:' + '\n')
    print(text)
    # print(a.text)  # If method above fails, comment it out and try this one.
    blank()
    print(url)
    blank()
    # print('HTML:' + '\n')
    # print(a.html)  # Extracts full HTML code of the webpage.

newscraper()
printOut()

Both work fine. The error I described earlier appears when I try to feed the .txt file with all the URLs to a modified version of newscraper.py. What I would like to do is to extract and print content, authors, etc of a list of hundreds of article URLs provided as output of sources.py. This modified script goes in the same direction as newscraper.py, I just changed the user input for a method that is able to read the URLs in the .txt one at a time:

with open('articles.txt', 'r') as f:
    for line in f:
        url = line.rstrip("\n")
        a = Article(url, language='pt')
        html = requests.get(url).text
        text = fulltext(html)
        download = a.download()
        parse = a.parse()
        nlp = a.nlp()
        title = a.title
        publish_date = a.publish_date
        authors = a.authors
        keywords = a.keywords
        summary = a.summary

The last one throws the error I was describing earlier: AttributeError: 'NoneType' object has no attribute 'xpath'

AndyTheFactory commented 1 year ago

Comment by Ask149 Mon Jan 7 17:27:28 2019


Hi @reactionhashs, I checked your last code with a few URLs in articles.txt. The error you are facing is because your requested some URL's articles don't contain the Full Body text in the P, PRE, and TD tags, while it's present in the DIV tag. So, a possible solution for you currently might be to ignore the URLs by using exception handling in python until @codelucas accepts the above-proposed solution.

screenshot from 2019-01-07 22-56-26

AndyTheFactory commented 1 year ago

Comment by Ask149 Mon Jan 7 17:42:19 2019


Till then I will recommend to use code below.

from newspaper import Article from newspaper import fulltext import requests with open('article.txt', 'r') as : for line in f: print(line) url = line.rstrip("\n") a = Article(url, language='pt') html = equests.get(url).text try: text = fulltext(html) download = a.download() parse = .parse() nlp = a.nlp() title = a.title publish_date = a.publish_date authors = a.authors keywords = a.keywords summary = a.summary except Exception as e: print("Error "+str(e))

It still depends on your project, how important it is for you to include every article is.

AndyTheFactory commented 1 year ago

Comment by reactionhashs Mon Jan 7 20:42:50 2019


@Ask149, thanks for taking the time to help me out. Gonna try to execute the code and I'll let you know how it goes. It is not completely necessary for me to scrape every single article, so I think this is going to work for me.

AndyTheFactory commented 1 year ago

Comment by fersarr Wed May 15 21:05:38 2019


I am having the same issue with a very simple html page:

$ python
Python 3.7.0 (default, Jun 28 2018, 07:39:16)
>>> import requests; import newspaper
>>> from newspaper import fulltext
>>> html = requests.get('http://localhost/.../articles/group2_2.html').text
>>> html
"<html>\n<body>\n<h1>Hi this is a fake article</h1>\n\n<p>Some random text</p>\n\n\n<a href='group2_3.html'>to 3</a>\n</body>\n</html>\n"
>>> fulltext(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mypath../lib/python3.7/site-packages/newspaper/api.py", line 91, in fulltext
    top_node = extractor.post_cleanup(top_node)
  File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 1040, in post_cleanup
    node = self.add_siblings(top_node)
  File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 869, in add_siblings
    baseline_score_siblings_para = self.get_siblings_score(top_node)
  File "/mypath../lib/python3.7/site-packages/newspaper/extractors.py", line 926, in get_siblings_score
    nodes_to_check = self.parser.getElementsByTag(top_node, tag='p')
  File "/mypath../lib/python3.7/site-packages/newspaper/parsers.py", line 123, in getElementsByTag
    elems = node.xpath(selector, namespaces=NS)
AttributeError: 'NoneType' object has no attribute 'xpath'

For some reason it does work for another, very similar fake article:

>>> html = requests.get('http://localhost/.../articles/fake_article1.html').text
>>> fulltext(html)
'Hi this is a fake article 1\n\nAnd it is about Basketball'
>>> html
"<h1>Hi this is a fake article 1</h1>\n\n<p>And it is about Basketball</p>\n\n<a href='fake_article2.html'>Link to Fake Article 2</a>\n"
>>>
AndyTheFactory commented 1 year ago

Comment by jamesaphoenix Sat Aug 17 18:10:39 2019


Thanks for finding the bug guys, really appreciate this!

AndyTheFactory commented 1 year ago

Comment by shashank7596 Thu Mar 26 13:04:37 2020


Hi @reactionhashs, I checked your last code with a few URLs in articles.txt. The error you are facing is because your requested some URL's articles don't contain the Full Body text in the P, PRE, and TD tags, while it's present in the DIV tag. So, a possible solution for you currently might be to ignore the URLs by using exception handling in python until @codelucas accepts the above-proposed solution.

screenshot from 2019-01-07 22-56-26

Seems like this div tag is not the issue. I had my scraper built with newspaper and I'm using fulltext() to get the content of the url like below gives me error AttributeError: 'NoneType' object has no attribute 'xpath'

from newspaper import fulltext
import cloudscraper

url = 'http://www.pharmafile.com/news/544464/novartis-reveals-hard-hitting-five-year-data-its-gene-therapy-zolgensma-spinal-muscular-'

scraper = cloudscraper.create_scraper()

ht = scraper.get(url)
txt = ht.text
print(fulltext(txt))

But the above piece of code works fine for this url "https://www.eurekalert.org/pub_releases/2020-03/dci-nrs031720.php" though the content is present in

tags succeeding the

tag.

Please help in resolving this issue. If there's already a resolution on this please guide me through that. This issue is priority in my current work and I'm trying this since a week but nothing really helps.

Thanks in advance.

AndyTheFactory commented 1 year ago

Comment by 8enmann Thu Jun 3 06:00:38 2021


Same issue for me, still no workaround?

AndyTheFactory commented 1 year ago

Comment by rashaduph26 Mon Nov 14 11:03:14 2022


I am experiencing the same issue, as @shashank7596 mentioned, even though the texts are written in div, it raises this AttributeError.

Any suggestions?