NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 740 forks source link

Cannot extract web search results from Yahoo #134

Open nptdat opened 8 years ago

nptdat commented 8 years ago

Thank you for the awesome search engine scraping tool!

I'm trying GoogleScraper to extract some URLs for given search terms. GoogleScraper works well with Google & Bing, but it cannot extract search results from Yahoo. Here are some commands:

GoogleScraper -m http --keyword "トヨタ" -p 3 -s yahoo

No results was returned.

Try to scrape with selenium:

GoogleScraper -m selenium --keyword "トヨタ" -p 3 -s yahoo --sel-browser phantomjs

The following error occurs:

Exception in thread [yahoo]SelScrape:
Traceback (most recent call last):
  File "/Users/xyz/projects/googlesearch/.virtualenv/lib/python3.4/site-packages/GoogleScraper/selenium_mode.py", line 455, in _find_next_page_element
    WebDriverWait(self.webdriver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, selector)))
  File "/Users/xyz/projects/googlesearch/.virtualenv/lib/python3.4/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Screenshot: available via screen

Because I search with Japanese terms, I tried to change the search URL for Yahoo in scrape_config.py file as the following:

yahoo_search_url = 'http://search.yahoo.co.jp/search?'

And, get another error:

selenium.common.exceptions.NoSuchElementException: Message: {"errorMessage":"Unable to find element with css selector '.compPagination strong'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"113","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:50667","User-Agent":"Python-urllib/3.4"},"httpVersion":"1.1","method":"POST","post":"{\"value\": \".compPagination strong\", \"using\": \"css selector\", \"sessionId\": \"c8247ec0-a669-11e5-8029-e94abf73bc4c\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/c8247ec0-a669-11e5-8029-e94abf73bc4c/element"}}
Screenshot: available via screen

Is there anything wrong with my setting? I'm using Mac OS 10.10.5

Thank you!

imaxmin commented 8 years ago

Hi @hitheone , have you figured this out? I'm suffering the same exception when using selenium+chrome on Mac.

DivyanshC commented 8 years ago

change 'result_container': 'li' in parsing.py