AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
498 stars 51 forks source link

Newspaper crashes trying to download image attachments #104

Closed AndyTheFactory closed 1 year ago

AndyTheFactory commented 1 year ago

Issue by chrisspen Thu Jun 29 05:54:27 2017 Originally opened as https://github.com/codelucas/newspaper/issues/391


For some URLs that specify image links, it appears Newpaper tries to download these and convert them to non-binary data, resulting in a UnicodeEncodeError:

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/article.py in parse(self=<newspaper.article.Article object>)
    235                 self.top_node)
    236             self.set_article_html(article_html)
    237             self.set_text(text)
    238 
    239         if self.config.fetch_images:
--> 240             self.fetch_images()
        self.fetch_images = <bound method Article.fetch_images of <newspaper.article.Article object>>
    241 
    242         self.is_parsed = True
    243         self.release_resources()
    244 

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/article.py in fetch_images(self=<newspaper.article.Article object>)
    254             self.set_imgs(imgs)
    255 
    256         if self.clean_top_node is not None and not self.has_top_image():
    257             first_img = self.extractor.get_first_img_url(
    258                 self.url, self.clean_top_node)
--> 259             self.set_top_img(first_img)
        self.set_top_img = <bound method Article.set_top_img of <newspaper.article.Article object>>
        first_img = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
    260 
    261         if not self.has_top_image():
    262             self.set_reddit_top_img()
    263 

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/article.py in set_top_img(self=<newspaper.article.Article object>, src_url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg')
    426         self.set_top_img_no_check(src_url)
    427 
    428     def set_top_img(self, src_url):
    429         if src_url is not None:
    430             s = images.Scraper(self)
--> 431             if s.satisfies_requirements(src_url):
        s.satisfies_requirements = <bound method Scraper.satisfies_requirements of <newspaper.images.Scraper object>>
        src_url = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
    432                 self.set_top_img_no_check(src_url)
    433 
    434     def set_top_img_no_check(self, src_url):
    435         """Provide 2 APIs for images. One at "top_img", "imgs"

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/images.py in satisfies_requirements(self=<newspaper.images.Scraper object>, img_url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg')
    219             area /= 10
    220         return area
    221 
    222     def satisfies_requirements(self, img_url):
    223         dimension = fetch_image_dimension(
--> 224             img_url, self.useragent, referer=self.url)
        img_url = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        self.useragent = 'newspaper/0.1.9'
        self.url = 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/'
    225         area = self.calculate_area(img_url, dimension)
    226         return area > minimal_area
    227 
    228     def thumbnail(self):

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/images.py in fetch_image_dimension(url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', useragent='newspaper/0.1.9', referer='http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', retries=1)
    162                 if response.raw._connection:
    163                     response.raw._connection.close()
    164 
    165 
    166 def fetch_image_dimension(url, useragent, referer=None, retries=1):
--> 167     return fetch_url(url, useragent, referer, retries, dimension=True)
        url = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        useragent = 'newspaper/0.1.9'
        referer = 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/'
        retries = 1
    168 
    169 
    170 class Scraper:
    171 

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/newspaper/images.py in fetch_url(url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', useragent='newspaper/0.1.9', referer='http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', retries=1, dimension=True)
     93     response = None
     94     while True:
     95         try:
     96             response = requests.get(url, stream=True, timeout=5, headers={
     97                 'User-Agent': useragent,
---> 98                 'Referer': referer,
        referer = 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/'
     99             })
    100 
    101             # if we only need the dimension of the image, we may not
    102             # need to download the entire thing

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/requests/api.py in get(url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', params=None, **kwargs={'allow_redirects': True, 'headers': {'Referer': 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', 'User-Agent': 'newspaper/0.1.9'}, 'stream': True, 'timeout': 5})
     67     :return: :class:`Response <Response>` object
     68     :rtype: requests.Response
     69     """
     70 
     71     kwargs.setdefault('allow_redirects', True)
---> 72     return request('get', url, params=params, **kwargs)
        url = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        params = None
        kwargs = {'allow_redirects': True, 'headers': {'Referer': 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', 'User-Agent': 'newspaper/0.1.9'}, 'stream': True, 'timeout': 5}
     73 
     74 
     75 def options(url, **kwargs):
     76     r"""Sends a OPTIONS request.

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/requests/api.py in request(method='get', url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', **kwargs={'allow_redirects': True, 'headers': {'Referer': 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', 'User-Agent': 'newspaper/0.1.9'}, 'params': None, 'stream': True, 'timeout': 5})
     53 
     54     # By using the 'with' statement we are sure the session is closed, thus we
     55     # avoid leaving sockets open which can trigger a ResourceWarning in some
     56     # cases, and look like a memory leak in others.
     57     with sessions.Session() as session:
---> 58         return session.request(method=method, url=url, **kwargs)
        session.request = <bound method Session.request of <requests.sessions.Session object>>
        method = 'get'
        url = 'http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        kwargs = {'allow_redirects': True, 'headers': {'Referer': 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', 'User-Agent': 'newspaper/0.1.9'}, 'params': None, 'stream': True, 'timeout': 5}
     59 
     60 
     61 def get(url, params=None, **kwargs):
     62     r"""Sends a GET request.

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/requests/sessions.py in request(self=<requests.sessions.Session object>, method='get', url='http://vote.us.org/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', params=None, data=None, headers={'Referer': 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/', 'User-Agent': 'newspaper/0.1.9'}, cookies=None, files=None, auth=None, timeout=5, allow_redirects=True, proxies={}, hooks=None, stream=True, verify=None, cert=None, json=None)
    508         send_kwargs = {
    509             'timeout': timeout,
    510             'allow_redirects': allow_redirects,
    511         }
    512         send_kwargs.update(settings)
--> 513         resp = self.send(prep, **send_kwargs)
        resp = undefined
        self.send = <bound method Session.send of <requests.sessions.Session object>>
        prep = <PreparedRequest [GET]>
        send_kwargs = {'allow_redirects': True, 'cert': None, 'proxies': OrderedDict(), 'stream': True, 'timeout': 5, 'verify': True}
    514 
    515         return resp
    516 
    517     def get(self, url, **kwargs):

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/requests/sessions.py in send(self=<requests.sessions.Session object>, request=<PreparedRequest [GET]>, **kwargs={'cert': None, 'proxies': OrderedDict(), 'stream': True, 'timeout': 5, 'verify': True})
    618 
    619         # Start time (approximately) of the request
    620         start = preferred_clock()
    621 
    622         # Send the request
--> 623         r = adapter.send(request, **kwargs)
        r = undefined
        adapter.send = <bound method HTTPAdapter.send of <requests.adapters.HTTPAdapter object>>
        request = <PreparedRequest [GET]>
        kwargs = {'cert': None, 'proxies': OrderedDict(), 'stream': True, 'timeout': 5, 'verify': True}
    624 
    625         # Total elapsed time of the request (approximately)
    626         elapsed = preferred_clock() - start
    627         r.elapsed = timedelta(seconds=elapsed)

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/requests/adapters.py in send(self=<requests.adapters.HTTPAdapter object>, request=<PreparedRequest [GET]>, stream=True, timeout=<urllib3.util.timeout.Timeout object>, verify=True, cert=None, proxies=OrderedDict())
    435                     redirect=False,
    436                     assert_same_host=False,
    437                     preload_content=False,
    438                     decode_content=False,
    439                     retries=self.max_retries,
--> 440                     timeout=timeout
        timeout = <urllib3.util.timeout.Timeout object>
    441                 )
    442 
    443             # Send the request.
    444             else:

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/urllib3/connectionpool.py in urlopen(self=<urllib3.connectionpool.HTTPConnectionPool object>, method='GET', url='/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', body=None, headers={'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'}, retries=Retry(total=0, connect=None, read=False, redirect=None, status=None), redirect=False, assert_same_host=False, timeout=<urllib3.util.timeout.Timeout object>, pool_timeout=None, release_conn=False, chunked=False, body_pos=None, **response_kw={'decode_content': False, 'preload_content': False})
    595 
    596             # Make the request on the httplib connection object.
    597             httplib_response = self._make_request(conn, method, url,
    598                                                   timeout=timeout_obj,
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
        chunked = False
    601 
    602             # If we're going to release the connection in ``finally:``, then
    603             # the response doesn't need to know about the connection. Otherwise
    604             # it will also try to release it and we'll have a double-release

...........................................................................
/usr/local/myproject/.env/lib/python3.5/site-packages/urllib3/connectionpool.py in _make_request(self=<urllib3.connectionpool.HTTPConnectionPool object>, conn=<urllib3.connection.HTTPConnection object>, method='GET', url='/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', timeout=<urllib3.util.timeout.Timeout object>, chunked=False, **httplib_request_kw={'body': None, 'headers': {'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'}})
    351         # conn.request() calls httplib.*.request, not the method in
    352         # urllib3.request. It also calls makefile (recv) on the socket.
    353         if chunked:
    354             conn.request_chunked(method, url, **httplib_request_kw)
    355         else:
--> 356             conn.request(method, url, **httplib_request_kw)
        conn.request = <bound method HTTPConnection.request of <urllib3.connection.HTTPConnection object>>
        method = 'GET'
        url = '/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        httplib_request_kw = {'body': None, 'headers': {'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'}}
    357 
    358         # Reset the timeout for the recv() on the socket
    359         read_timeout = timeout_obj.read_timeout
    360 

...........................................................................
/usr/lib/python3.5/http/client.py in request(self=<urllib3.connection.HTTPConnection object>, method='GET', url='/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', body=None, headers={'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'})
   1101             raise CannotSendHeader()
   1102         self._send_output(message_body)
   1103 
   1104     def request(self, method, url, body=None, headers={}):
   1105         """Send a complete request to the server."""
-> 1106         self._send_request(method, url, body, headers)
        self._send_request = <bound method HTTPConnection._send_request of <urllib3.connection.HTTPConnection object>>
        method = 'GET'
        url = '/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg'
        body = None
        headers = {'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'}
   1107 
   1108     def _set_content_length(self, body, method):
   1109         # Set the content-length based on the body. If the body is "empty", we
   1110         # set Content-Length: 0 for methods that expect a body (RFC 7230,

...........................................................................
/usr/lib/python3.5/http/client.py in _send_request(self=<urllib3.connection.HTTPConnection object>, method='GET', url='/PF.Base/file/attachment/2016/10/60bed53037f54b1b3d6ce7e53d86779b.jpg', body=None, headers={'Accept-Encoding': 'gzip, deflate', 'Connection...rusade-on-cbi-investigation-of-cm’s-principal-/'})
   1141         self.putrequest(method, url, **skips)
   1142 
   1143         if 'content-length' not in header_names:
   1144             self._set_content_length(body, method)
   1145         for hdr, value in headers.items():
-> 1146             self.putheader(hdr, value)
        self.putheader = <bound method HTTPConnection.putheader of <urllib3.connection.HTTPConnection object>>
        hdr = 'Referer'
        value = 'http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/'
   1147         if isinstance(body, str):
   1148             # RFC 2616 Section 3.7.1 says that text default has a
   1149             # default charset of iso-8859-1.
   1150             body = _encode(body, 'body')

...........................................................................
/usr/lib/python3.5/http/client.py in putheader(self=<urllib3.connection.HTTPConnection object>, header=b'Referer', *values=['http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/'])
   1073             raise ValueError('Invalid header name %r' % (header,))
   1074 
   1075         values = list(values)
   1076         for i, one_value in enumerate(values):
   1077             if hasattr(one_value, 'encode'):
-> 1078                 values[i] = one_value.encode('latin-1')
        values = ['http://vote.us.org/vote/372/do-you-agree-with-ke...-crusade-on-cbi-investigation-of-cm’s-principal-/']
        i = 0
        one_value.encode = <built-in method encode of str object>
   1079             elif isinstance(one_value, int):
   1080                 values[i] = str(one_value).encode('ascii')
   1081 
   1082             if _is_illegal_header_value(values[i]):

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 90: ordinal not in range(256)
AndyTheFactory commented 1 year ago

Comment by chrisspen Thu Jun 29 05:55:17 2017


Is there any workaround for this? Why is it trying to download images when I'm trying to extract the article text?

AndyTheFactory commented 1 year ago

Comment by codelucas Thu Jun 29 06:51:39 2017


Thanks for finding this @chrisspen - will take a look soon

AndyTheFactory commented 1 year ago

website does not work anymore