NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.63k stars 735 forks source link

UnicodeEncodeError #154

Open h3litz opened 8 years ago

h3litz commented 8 years ago

Oh,first thank you for you tool,i tried to use it than it make follow issues,is this my path problem?

(py34env) root:~# GoogleScraper -m http -q "apple"
2016-05-26 02:29:38,885 - GoogleScraper.caching - INFO - 0 cache files found in .scrapecache/
2016-05-26 02:29:38,885 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped.
2016-05-26 02:29:38,893 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads.
2016-05-26 02:29:38,894 - GoogleScraper.scraping - INFO - [+] HttpScrape[localhost][search-type:normal][https://www.google.com/search?] using search engine "google". Num keywords=1, num pages for keyword=[1]
2016-05-26 02:29:40,897 - GoogleScraper.scraping - INFO - [[google]HttpScrape][localhost]]Keyword: "apple" with [1] pages, slept 2 seconds before scraping. 1/1 already scraped.
2016-05-26 02:29:40,908 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): www.google.com
2016-05-26 02:29:40,979 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): www.google.co.jp
{'effective_query': '',
 'id': '3',
 'no_results': 'False',
 'num_results': '7',
 'num_results_for_query': 'About 1,400,000,000 results (0.38 seconds)\xa0',
 'page_number': '1',
 'query': 'apple',
 'requested_at': '2016-05-25 17:29:41.517119',
 'requested_by': 'localhost',
 'results': [{'domain': 'www.apple.com',
              'id': '16',
              'link': 'http://www.apple.com/',
              'link_type': 'results',
              'rank': '1',
              'serp_id': '3',
              'snippet': 'Apple leads the world in innovation with iPhone, '
                         'iPad, Mac, Apple Watch, iOS, OS X, watchOS and '
                         'more. Visit the site to learn, buy, and get '
                         'support.',
              'title': 'Apple',
              'visible_link': 'www.apple.com/'},
             {'domain': 'en.wikipedia.org',
              'id': '17',
              'link': 'https://en.wikipedia.org/wiki/Apple_Inc.',
              'link_type': 'results',
              'rank': '5',
              'serp_id': '3',
              'snippet': 'Apple Inc. is an American multinational '
                         'technology company headquartered in Cupertino, '
                         'California, that designs, develops, and sells '
                         'consumer electronics,\xa0...',
              'title': 'Apple Inc. - Wikipedia, the free encyclopedia',
              'visible_link': 'https://en.wikipedia.org/wiki/Apple_Inc.'},
             {'domain': 'en.wikipedia.org',
              'id': '18',
              'link': 'https://en.wikipedia.org/wiki/Apple',
              'link_type': 'results',
              'rank': '6',
              'serp_id': '3',
              'snippet': 'The apple tree (Malus domestica) is a deciduous '
                         'tree in the rose family best known for its '
                         'sweet, pomaceous fruit, the apple. It is '
                         'cultivated worldwide as a fruit \xa0...',
              'title': 'Apple - Wikipedia, the free encyclopedia',
              'visible_link': 'https://en.wikipedia.org/wiki/Apple'},
             {'domain': 'www.youtube.com',
              'id': '19',
              'link': 'https://www.youtube.com/user/Apple',
              'link_type': 'results',
              'rank': '7',
              'serp_id': '3',
              'snippet': 'Apple revolutionized personal technology with '
                         'the introduction of the Macintosh in 1984. '
                         'Today, Apple leads the world in innovation with '
                         'iPhone, iPad, the Ma...',
              'title': 'Apple - YouTube',
              'visible_link': 'https://www.youtube.com/user/Apple'},
             {'domain': 'www.forbes.com',
              'id': '20',
              'link': 'http://www.forbes.com/companies/apple/',
              'link_type': 'results',
              'rank': '8',
              'serp_id': '3',
              'snippet': 'Apple, Inc. designs, manufactures, and markets '
                         'mobile communication and media devices, personal '
                         'computers, portable digital music players, and '
                         'sells a\xa0...',
              'title': "Apple on the Forbes World's Most Valuable Brands "
                       'List',
              'visible_link': 'www.forbes.com/companies/apple/'},
             {'domain': 'finance.yahoo.com',
              'id': '21',
              'link': 'http://finance.yahoo.com/q?s=AAPL',
              'link_type': 'results',
              'rank': '9',
              'serp_id': '3',
              'snippet': 'View the basic AAPL stock chart on Yahoo! '
                         'Finance. Change the date range, chart type and '
                         'compare Apple Inc. against other companies.',
              'title': 'AAPL: Summary for Apple Inc.- Yahoo! Finance',
              'visible_link': 'finance.yahoo.com/q?s=AAPL'},
             {'domain': '',
              'id': '22',
              'link': '/aclk?sa=L&ai=Cxx1pBeFFV5ylA5Kv9gWtoJjAAdXMrY0HzYbFl6YChvCRBQgAEAFgiQPIAQGpAoFfYE3mekM-qgQiT9BUuW9UvdZaOSSHdvQYhzmAAmsjYOhhtXi1Ike-tO5q_oAH3bK7M5AHAagHpr4b2AcB&sig=AOD64_1dYAHzVv_Lg8gKZqjQVXQJUYCbLQ&clui=0&q=&ved=0ahUKEwiKrZum4PXMAhWF5qYKHQq6C1EQ0QwIGg&adurl=http://tracker.marinsm.com/rd%3Fcid%3D18707vxu38484%26mkwid%3DsKSMckWTz-dc%26lp%3Dhttp://store.apple.com/jp/go/home%253F%2526mnid%253DsKSMckWTz-dc_mtid_18707vxu38484_pcrid_78978864621_%2526cid%253Daos-jp-kwg-brand-slid-%2526mtid%253D18707vxu38484%2526muid%253D%7Bm_uid%7D%2526aosid%253Dp238',
              'link_type': 'ads_main',
              'rank': '1',
              'serp_id': '3',
              'snippet': Exception in thread [google]HttpScrape:
Traceback (most recent call last):
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/py34env/lib/python3.4/site-packages/GoogleScraper/http_mode.py", line 306, in run
    if not self.search(rand=True):
  File "/py34env/lib/python3.4/site-packages/GoogleScraper/http_mode.py", line 294, in search
    super().after_search()
  File "/py34env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 360, in after_search
    if not self.store():
  File "/py34env/lib/python3.4/site-packages/GoogleScraper/scraping.py", line 292, in store
    store_serp_result(serp, self.config)
  File "/py34env/lib/python3.4/site-packages/GoogleScraper/output_converter.py", line 123, in store_serp_result
    pprint.pprint(data)
  File "/usr/lib/python3.4/pprint.py", line 52, in pprint
    printer.pprint(object)
  File "/usr/lib/python3.4/pprint.py", line 139, in pprint
    self._format(object, self._stream, 0, 0, {}, 0)
  File "/usr/lib/python3.4/pprint.py", line 193, in _format
    allowance + 1, context, level)
  File "/usr/lib/python3.4/pprint.py", line 230, in _format
    allowance + 1, context, level)
  File "/usr/lib/python3.4/pprint.py", line 297, in _format_items
    self._format(ent, stream, indent, allowance, context, level)
  File "/usr/lib/python3.4/pprint.py", line 193, in _format
    allowance + 1, context, level)
  File "/usr/lib/python3.4/pprint.py", line 274, in _format
    write(rep)
UnicodeEncodeError: 'ascii' codec can't encode character '\u3001' in position 4: ordinal not in range(128)
ddmee commented 8 years ago

Yeah I have the same issue...

ddmee commented 8 years ago

OK, so it looks like you are on a linux machine. I had the problem on a windows machine. Except I was having difficult with unicode '\u201c'. The problem is probably that your terminal does not support displaying that ascii code.

So I found, if I tried to print the unicode I was having trouble with at the python interpreter I got the same error. (I am doing this in Powershell).


 Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u201c')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 0: character maps to <undefined>
>>> quit()

The solution on in Powershell is to switch the encoding to something that supports utf - the encoding used by googlescraper.


PS C:\Users\User\> chcp
Active code page: 850
PS C:\Users\User\> chcp 65001
Active code page: 65001

Doing this, the issue went away. So, you probably have to switch the encoding of your shell your using in linux to something else.

Here is a link to a stackoverflow question talking about this issue.