NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 741 forks source link

Asynchronous mode issue #138

Open giacmarangoni opened 8 years ago

giacmarangoni commented 8 years ago

Hi, I really want to thank you for your module, it's very nice. I'm trying to scrape bing using asynchronous mode but unfortunately I got this error:

GoogleScraper -s "bing" --keyword-file keywords -m http-async -v NOTSET

2016-02-02 11:54:57,366 - GoogleScraper.core - INFO - Continuing last scrape.
2016-02-02 11:54:57,366 - GoogleScraper.caching - INFO - 0 cache files found in .scrapecache/
2016-02-02 11:54:57,366 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped.
2016-02-02 11:54:57,366 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads.
2016-02-02 11:54:57,366 - asyncio - DEBUG - Using selector: EpollSelector
2016-02-02 11:54:57,692 - GoogleScraper.async_mode - INFO - [+] localhost requested keyword 'apple' on bing. Response status: 200
2016-02-02 11:54:57,693 - GoogleScraper.async_mode - DEBUG - [i] URL: http://www.bing.com/search?q=apple HEADERS: {'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate'}
Traceback (most recent call last):
  File "/home/giacomomarangoni/env/bin/GoogleScraper", line 9, in <module>
    load_entry_point('GoogleScraper==0.2.1', 'console_scripts', 'GoogleScraper')()
  File "/home/giacomomarangoni/env/lib/python3.4/site-packages/GoogleScraper/core.py", line 446, in main
    scheduler.run()
  File "/home/giacomomarangoni/env/lib/python3.4/site-packages/GoogleScraper/async_mode.py", line 121, in run
    scrape = task.result()
  File "/usr/lib/python3.4/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.4/asyncio/tasks.py", line 239, in _step
    result = coro.send(value)
  File "/home/giacomomarangoni/env/lib/python3.4/site-packages/GoogleScraper/async_mode.py", line 65, in request
    body = yield from response.read_and_close(decode=False)
AttributeError: 'ClientResponse' object has no attribute 'read_and_close'
2016-02-02 11:54:57,725 - asyncio - ERROR - Unclosed response
client_response: <ClientResponse(http://www.bing.com/search?q=apple&q=apple) [200 OK]>
<CIMultiDictProxy('CACHE-CONTROL': 'private, max-age=0', 'TRANSFER-ENCODING': 'chunked', 'CONTENT-TYPE': 'text/html; charset=utf-8', 'CONTENT-ENCODING': 'gzip', 'EXPIRES': 'Tue, 02 Feb 2016 19:53:57 GMT', 'VARY': 'Accept-Encoding', 'SERVER': 'Microsoft-IIS/8.5', 'P3P': 'CP="NON UNI COM NAV STA LOC CURa DEVa PSAa PSDa OUR IND"', 'SET-COOKIE': 'SRCHD=AF=NOFORM; domain=.bing.com; expires=Fri, 02-Feb-2018 19:54:57 GMT; path=/', 'SET-COOKIE': 'SRCHUID=V=2&GUID=D153F8FFEC5944578CD08E067D357B0C; expires=Fri, 02-Feb-2018 19:54:57 GMT; path=/', 'SET-COOKIE': 'SRCHUSR=DOB=20160202; domain=.bing.com; expires=Fri, 02-Feb-2018 19:54:57 GMT; path=/', 'SET-COOKIE': '_SS=SID=07F7F82D3B9E623731E5F0823AEC633F; domain=.bing.com; path=/', 'X-MSEDGE-REF': 'Ref A: 258627521BAA457B8C49184E8070B074 Ref B: 2E640288516791BD077BF04BE5D5D150 Ref C: Tue Feb 02 11:54:57 2016 PST', 'SET-COOKIE': '_EDGE_S=F=1&SID=07F7F82D3B9E623731E5F0823AEC633F; path=/; httponly; domain=bing.com', 'SET-COOKIE': '_EDGE_V=1; path=/; httponly; expires=Thu, 01-Feb-2018 19:54:57 GMT; domain=bing.com', 'SET-COOKIE': 'MUID=065BFEFE89CE6A5221D4F65188BC6B03; path=/; expires=Thu, 01-Feb-2018 19:54:57 GMT; domain=bing.com', 'SET-COOKIE': 'MUIDB=065BFEFE89CE6A5221D4F65188BC6B03; path=/; httponly; expires=Thu, 01-Feb-2018 19:54:57 GMT', 'DATE': 'Tue, 02 Feb 2016 19:54:57 GMT')>

Any ideas?

Thank you.

JackMordaunt commented 8 years ago

I had same issue and fixed it by changing this line: "yield from response.read_and_close(decode=False)" to this: "yield from response.text()" In the "async_mode.py" file.

After that I was getting a ClientResponseError which led me to wrap the "yield from response.text()" in a try except to handle the error gracefully.

It's working now :) Cheers.