NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
Apache License 2.0
2.6k stars 734 forks source link

Google and Bing for images #130

Open carlosdcastillo opened 8 years ago

carlosdcastillo commented 8 years ago

I was using Google Scraper to download images for a list of search terms, as follows:

GoogleScraper -m selenium --search-engines "google,bing" --keyword-file keywords.txt -v2 -t image --output-filename output.json

and it fails to get any urls (I check by using sqlite and then run the query select * from link;) If I use yahoo or yandex it does work great. Also, I've confirmed that google and bing are not working correctly from both Linux and OS X.

What am I doing wrong? How can I troubleshoot this and solve this problem?

Thank you very much!

besirkurtulmus commented 8 years ago

+1 for this issue. Google and Bing do not return any images for me too. I'm on OS X 10.10.5.

jeroenarens commented 8 years ago

Google and Bing do not return any results to me either. I'm on Ubuntu 14.04 64 bit

NikolaiT commented 8 years ago

Thanks for you participation. Will look into this the coming days.

asfaltboy commented 8 years ago

@NikolaiT in case you haven't taken a look yet, seems at least one selector for google image search is broken:

$ GoogleScraper -q "Seville" -t image -vDEBUG
2016-01-03 11:02:59,389 - GoogleScraper.caching - INFO - 7 cache files found in .scrapecache/
2016-01-03 11:02:59,390 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped.
2016-01-03 11:02:59,399 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads.
2016-01-03 11:02:59,400 - GoogleScraper.scraping - INFO - [+] HttpScrape[localhost][search-type:image][] using search engine "google". Num keywords=1, num pages for keyword=[1]
2016-01-03 11:03:01,401 - GoogleScraper.scraping - INFO - [[google]HttpScrape][localhost]]Keyword: "Seville" with [1] pages, slept 2 seconds before scraping. 1/1 already scraped.
2016-01-03 11:03:01,423 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1):
2016-01-03 11:03:01,719 - requests.packages.urllib3.connectionpool - DEBUG - "GET /search?hl=en&q=Seville&oq=Seville&biw=1920&tbm=isch&source=hp&bih=881&site=imghp HTTP/1.1" 200 None
2016-01-03 11:03:02,428 - GoogleScraper.http_mode - DEBUG - [HTTP -, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'keep-alive', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) CriOS/39.0.2171.50 Mobile/12B440 Safari/600.1.4'}, params={'hl': 'en', 'q': 'Seville', 'oq': 'Seville', 'biw': 1920, 'tbm': 'isch', 'source': 'hp', 'bih': 881, 'site': 'imghp'}
2016-01-03 11:03:02,442 - GoogleScraper.parsing - DEBUG - GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats']
{'effective_query': '',
 'id': '8',
 'no_results': 'False',
 'num_results': '0',
 'num_results_for_query': '0',
 'page_number': '1',
 'query': 'Seville',
 'requested_at': '2016-01-03 10:03:02.428049',
 'requested_by': 'localhost',
 'results': [],
 'scrape_method': 'http',
 'search_engine_name': 'google',
 'status': 'successful'}
2016-01-03 11:03:02,474 - GoogleScraper.scraping - DEBUG - No results to store for keyword: "Seville" in search engine: google
m3nu commented 8 years ago

149 fixes the selectors for Google and Yahoo. Some small details changed.

jtchilders commented 7 years ago

This seems to be a problem again. I am running on OSX 10.11.6 and run:

GoogleScraper -t image -m selenium -q lego -o output.json --sel-browser chrome -n 50 --search-engines google -v DEBUG

I include the full output below, but the primary error is GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats']

2016-08-30 19:33:59,249 - GoogleScraper.caching - INFO - 0 cache files found in .scrapecache/ 2016-08-30 19:33:59,250 - GoogleScraper.caching - INFO - 0/1 objects have been read from the cache. 1 remain to get scraped. 2016-08-30 19:33:59,253 - GoogleScraper.core - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads. 2016-08-30 19:33:59,254 - GoogleScraper.scraping - INFO - [+] SelScrape[localhost][search-type:image][] using search engine "google". Num keywords=1, num pages for keyword=[1] 2016-08-30 19:34:00,274 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"desiredCapabilities": {"platform": "ANY", "version": "", "chromeOptions": {"extensions": [], "args": []}, "browserName": "chrome", "javascriptEnabled": true}, "requiredCapabilities": {}} 2016-08-30 19:34:01,070 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,071 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "windowHandle": "current", "height": 400, "width": 400} 2016-08-30 19:34:01,390 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,391 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"y": 0, "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "windowHandle": "current", "x": 400} 2016-08-30 19:34:01,501 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:01,501 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "url": ""} 2016-08-30 19:34:02,381 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,381 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "using": "name", "value": "q"} 2016-08-30 19:34:02,401 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,401 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"id": "0.9976552533101597-1", "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:02,422 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:02,674 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"id": "0.9976552533101597-1", "sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "value": ["l", "e", "g", "o", "\ue007"]} 2016-08-30 19:34:02,751 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,751 - GoogleScraper.scraping - INFO - [[google]SelScrape][localhost]]Keyword: "lego" with [1] pages, slept 2 seconds before scraping. 1/1 already scraped. 2016-08-30 19:34:04,752 - selenium.webdriver.remote.remote_connection - DEBUG - GET {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:04,760 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,761 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "script": "return document.body.innerHTML;", "args": []} 2016-08-30 19:34:04,876 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:04,911 - GoogleScraper.parsing - DEBUG - GoogleParser: Cannot parse num_results from serp page with selectors ['#resultStats'] 2016-08-30 19:34:04,979 - GoogleScraper.scraping - DEBUG - No results to store for keyword: "lego" in search engine: google 2016-08-30 19:34:05,075 - selenium.webdriver.remote.remote_connection - DEBUG - POST {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b", "script": "\n var w = window,\n d = document,\n e = d.documentElement,\n g = d.getElementsByTagName('body')[0],\n y = w.innerHeight|| e.clientHeight|| g.clientHeight;\n\n window.scrollBy(0,y);\n return y;\n ", "args": []} 2016-08-30 19:34:05,081 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request 2016-08-30 19:34:05,082 - selenium.webdriver.remote.remote_connection - DEBUG - DELETE {"sessionId": "7559e99ba1e487d7057ee0a5c0a83a6b"} 2016-08-30 19:34:05,139 - selenium.webdriver.remote.remote_connection - DEBUG - Finished Request

asfaltboy commented 7 years ago

@jtchilders believe @NikolaiT has yet to merge @m3nu's PR #149 yet.

Can you please try out the and give it a try his fork and report back whether the fix works for you or not?

jtchilders commented 7 years ago

Ah, I see @asfaltboy. Thanks for the hint! That did the trick.