NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

Google and Bing for images #130

Open carlosdcastillo opened 8 years ago

carlosdcastillo commented 8 years ago

I was using Google Scraper to download images for a list of search terms, as follows:

GoogleScraper -m selenium --search-engines "google,bing" --keyword-file keywords.txt -v2 -t image --output-filename output.json

and it fails to get any urls (I check by using sqlite and then run the query select * from link;) If I use yahoo or yandex it does work great. Also, I've confirmed that google and bing are not working correctly from both Linux and OS X.

What am I doing wrong? How can I troubleshoot this and solve this problem?

Thank you very much!

besirkurtulmus commented 8 years ago

+1 for this issue. Google and Bing do not return any images for me too. I'm on OS X 10.10.5.

jeroenarens commented 8 years ago

Google and Bing do not return any results to me either. I'm on Ubuntu 14.04 64 bit

NikolaiT commented 8 years ago

Thanks for you participation. Will look into this the coming days.

asfaltboy commented 8 years ago

@NikolaiT in case you haven't taken a look yet, seems at least one selector for google image search is broken:

$ GoogleScraper -q "Seville" -t image -vDEBUG
2016-01-03 11:02:59,389 - GoogleScraper.caching - INFO - 7 cache files found in .scrapecache/
2016-01-03 11:02:59,390 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped.
2016-01-03 11:02:59,399 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads.
2016-01-03 11:02:59,400 - GoogleScraper.scraping - INFO - [+] HttpScrape[localhost][search-type:image][https://www.google.com/search?] using search engine "google". Num keywords=1, num pages for keyword=[1]
2016-01-03 11:03:01,401 - GoogleScraper.scraping - INFO - [[google]HttpScrape][localhost]]Keyword: "Seville" with [1] pages, slept 2 seconds before scraping. 1/1 already scraped.
2016-01-03 11:03:01,423 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): www.google.com
2016-01-03 11:03:01,719 - requests.packages.urllib3.connectionpool - DEBUG - "GET /search?hl=en&q=Seville&oq=Seville&biw=1920&tbm=isch&source=hp&bih=881&site=imghp HTTP/1.1" 200 None
2016-01-03 11:03:02,428 - GoogleScraper.http_mode - DEBUG - [HTTP - https://www.google.com/search?hl=en&q=Seville&oq=Seville&biw=1920&tbm=isch&source=hp&bih=881&site=imghp, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) CriOS/39.0.2171.50 Mobile/12B440 Safari/600.1.4'}, params={'hl': 'en', 'q': 'Seville', 'oq': 'Seville', 'biw': 1920, 'tbm': 'isch', 'source': 'hp', 'bih': 881, 'site': 'imghp'}
2016-01-03 11:03:02,442 - GoogleScraper.parsing - DEBUG - GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats']
{'effective_query': '',
 'id': '8',
 'no_results': 'False',
 'num_results': '0',
 'num_results_for_query': '0',
 'page_number': '1',
 'query': 'Seville',
 'requested_at': '2016-01-03 10:03:02.428049',
 'requested_by': 'localhost',
 'results': [],
 'scrape_method': 'http',
 'search_engine_name': 'google',
 'status': 'successful'}
2016-01-03 11:03:02,474 - GoogleScraper.scraping - DEBUG - No results to store for keyword: "Seville" in search engine: google
m3nu commented 8 years ago

149 fixes the selectors for Google and Yahoo. Some small details changed.

jtchilders commented 7 years ago

This seems to be a problem again. I am running on OSX 10.11.6 and run:

GoogleScraper -t image -m selenium -q lego -o output.json --sel-browser chrome -n 50 --search-engines google -v DEBUG

I include the full output below, but the primary error is GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats']

2016-08-30 19:33:59,249 - GoogleScraper.caching - INFO - 0 cache files found in .scrapecache/ 2016-08-30 19:33:59,250 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped. 2016-08-30 19:33:59,253 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads. 2016-08-30 19:33:59,254 - GoogleScraper.scraping - INFO - [+] SelScrape[localhost][search-type:image][https://www.google.com/search?] using search engine "google". Num keywords=1, num pages for keyword=[1] 2016-08-30 19:34:00,274 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session {"desiredCapabilities": {"platform": "ANY", "version": "", "chromeOptions": {"extensions": [], "args": []}, "browserName": "chrome", "javascriptEnabled": true}, "requiredCapabilities": {}} 2016-08-30 19:34:01,070 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,071 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/window/current/size {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "windowHandle": "current", "height": 400, "width": 400} 2016-08-30 19:34:01,390 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,391 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/window/current/position {"y": 0, "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "windowHandle": "current", "x": 400} 2016-08-30 19:34:01,501 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,501 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/url {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "url": "https://www.google.com/imghp"} 2016-08-30 19:34:02,381 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,381 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/element {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "using": "name", "value": "q"} 2016-08-30 19:34:02,401 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,401 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/element/0.9976552533101597-1/clear {"id": "0.9976552533101597-1", "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:02,422 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,674 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/element/0.9976552533101597-1/value {"id": "0.9976552533101597-1", "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "value": ["l", "e", "g", "o", "\ue007"]} 2016-08-30 19:34:02,751 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,751 - GoogleScraper.scraping - INFO - [[google]SelScrape][localhost]]Keyword: "lego" with [1] pages, slept 2 seconds before scraping. 1/1 already scraped. 2016-08-30 19:34:04,752 - selenium.webdriver.remote.remote_connection - DEBUG - GET http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/title {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:04,760 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,761 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/execute {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "script": "return document.body.innerHTML;", "args": []} 2016-08-30 19:34:04,876 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,911 - GoogleScraper.parsing - DEBUG - GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats'] 2016-08-30 19:34:04,979 - GoogleScraper.scraping - DEBUG - No results to store for keyword: "lego" in search engine: google 2016-08-30 19:34:05,075 - selenium.webdriver.remote.remote_connection - DEBUG - POST http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b/execute {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "script": "\n var w = window,\n d = document,\n e = d.documentElement,\n g = d.getElementsByTagName('body')[0],\n y = w.innerHeight|| e.clientHeight|| g.clientHeight;\n\n window.scrollBy(0,y);\n return y;\n ", "args": []} 2016-08-30 19:34:05,081 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:05,082 - selenium.webdriver.remote.remote_connection - DEBUG - DELETE http://127.0.0.1:51621/session/7559e99ba1e487d7057ee0a5c0a83a6b {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:05,139 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request

asfaltboy commented 7 years ago

@jtchilders believe @NikolaiT has yet to merge @m3nu's PR #149 yet.

Can you please try out the and give it a try his fork and report back whether the fix works for you or not?

jtchilders commented 7 years ago

Ah, I see @asfaltboy. Thanks for the hint! That did the trick.