NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 743 forks source link

Not returning results #66

Open aidiss opened 9 years ago

aidiss commented 9 years ago
D:\scraping\googlescrap>GoogleScraper -m selenium --keyword-file testkeywords.txt --output-filename output.json -v2 --search-engines "google,bing,yahoo"
2015-01-14 07:05:10,082 - GoogleScraper - INFO - 10 cache files found in .scrapecache/
2015-01-14 07:05:10,082 - GoogleScraper - INFO - 0/9 keywords have been cached and are ready to get parsed. 9 remain to get scraped.
2015-01-14 07:05:10,105 - GoogleScraper - INFO - Going to scrape 3 keywords with 1 proxies by using 3 threads.
2015-01-14 07:05:10,105 - GoogleScraper - INFO - [+] SelScrape[localhost][search-type:normal][https://de.search.yahoo.com/search?] using search engine "yahoo". Num keywords =3, num pages for keyword=1
2015-01-14 07:05:10,105 - GoogleScraper - INFO - [+] SelScrape[localhost][search-type:normal][http://www.bing.com/search?] using search engine "bing". Num keywords =3, num pages for keyword=1
2015-01-14 07:05:10,105 - GoogleScraper - INFO - [+] SelScrape[localhost][search-type:normal][https://www.google.com/search?] using search engine "google". Num keywords =3, num pages for keyword=1
2015-01-14 07:05:19,227 - GoogleScraper - INFO - [Thread-3][localhost][bing]Keyword: "second" with 1 pages, slept 1 seconds before scraping. 1/3 already scraped.
2015-01-14 07:05:20,386 - GoogleScraper - INFO - [Thread-2][localhost][yahoo]Keyword: "second" with 1 pages, slept 1 seconds before scraping. 1/3 already scraped.
2015-01-14 07:05:20,984 - GoogleScraper - INFO - [Thread-3][localhost][bing]Keyword: "third" with 1 pages, slept 1 seconds before scraping. 2/3 already scraped.
2015-01-14 07:05:22,689 - GoogleScraper - INFO - [Thread-2][localhost][yahoo]Keyword: "third" with 1 pages, slept 1 seconds before scraping. 2/3 already scraped.
2015-01-14 07:05:23,048 - GoogleScraper - INFO - [Thread-4][localhost][google]Keyword: "second" with 1 pages, slept 5 seconds before scraping. 1/3 already scraped.
2015-01-14 07:05:23,142 - GoogleScraper - INFO - [Thread-3][localhost][bing]Keyword: "first" with 1 pages, slept 1 seconds before scraping. 3/3 already scraped.
2015-01-14 07:05:24,945 - GoogleScraper - INFO - [Thread-2][localhost][yahoo]Keyword: "first" with 1 pages, slept 1 seconds before scraping. 3/3 already scraped.
2015-01-14 07:05:27,888 - GoogleScraper - INFO - [Thread-4][localhost][google]Keyword: "third" with 1 pages, slept 4 seconds before scraping. 2/3 already scraped.
2015-01-14 07:05:32,493 - GoogleScraper - INFO - [Thread-4][localhost][google]Keyword: "first" with 1 pages, slept 4 seconds before scraping. 3/3 already scraped.

testkeywords.txt includes:

first
second
third

output.json

[]

About my installation

             platform : win-32
        conda version : 3.7.4
  conda-build version : not installed
       python version : 3.4.1.final.0

I have reinstalled GoogleScraper today with pip uninstall GoogleScraper and pip install GoogleScraper .

Also, it might be a seperate issue, but it seems like Chrome webdriver is still running after finishing scrapying.

NikolaiT commented 9 years ago

Yeah, experienced this also yesterday. Will fix this in 3 hours after uni classes :)

NikolaiT commented 9 years ago

Should be fixed now. Can you confirm?

aidiss commented 9 years ago
2015-01-15 06:44:56,121 - GoogleScraper - INFO - 0 cache files found in .scrapec
ache/..........................................
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\c\lib\threading.py", line 920, in _bootstrap_inner
    self.run()
  File "C:\c\lib\site-packages\GoogleScraper\selenium_mode.py", line 431, in run

    self.search()
  File "C:\c\lib\site-packages\GoogleScraper\selenium_mode.py", line 385, in sea
rch
    super().after_search()
  File "C:\c\lib\site-packages\GoogleScraper\scraping.py", line 400, in after_se
arch
    if not self.store():
  File "C:\c\lib\site-packages\GoogleScraper\scraping.py", line 334, in store
    store_serp_result(serp)
  File "C:\c\lib\site-packages\GoogleScraper\output_converter.py", line 65, in s
tore_serp_result
    outfile.writeheader()
  File "C:\c\lib\csv.py", line 142, in writeheader
    self.writerow(header)
  File "C:\c\lib\csv.py", line 153, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "C:\c\lib\csv.py", line 149, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'domain', 'id', 'rank', 'sni
ppet'
NikolaiT commented 9 years ago

Working on this.