NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.64k stars 740 forks source link

selscraper error cause googlescraper stop #58

Open leadscloud opened 9 years ago

leadscloud commented 9 years ago
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\Python33\lib\threading.py", line 637, in _bootstrap_inner
    self.run()
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\selenium.py", line 486, in run
    self.search()
  File "D:\workfiles\PythonScript\GoogleScraper-master\GoogleScraper\selenium.py", line 418, in search
    self.search_input.clear()
  File "C:\Python33\lib\site-packages\selenium\webdriver\remote\webelement.py", line 73, in clear
    self._execute(Command.CLEAR_ELEMENT)
  File "C:\Python33\lib\site-packages\selenium\webdriver\remote\webelement.py", line 385, in _execute
    return self._parent.execute(command, params)
  File "C:\Python33\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 173, in execute
    self.error_handler.check_response(response)
  File "C:\Python33\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 166, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidElementStateException: Message: invalid element state: Element is not currently interactable and may not be manipulated
  (Session info: chrome=39.0.2171.95)
  (Driver info: chromedriver=2.9.248315,platform=Windows NT 6.1 SP1 x86_64)

Process finished with exit code 0

486 is self.search() 418 is self.search_input.clear()

leadscloud commented 9 years ago

that's because i changed your code. but there is an other error to solve.

for yahoo search, if your open the yahoo in Incognito window,search any keyword, then in console run document.getElementsByName("p") will be return three element

 <input type="text" id="vsyc-yschsp" name="p" class="qstext" autocomplete="off" value="">
 <input type="text" class="sbq" id="yschsp" name="p" value="sand making machine" autocomplete="off" tabindex="1" autocorrect="off" autocapitalize="off" aria-haspopup="true" style="-webkit-tap-highlight-color: transparent;">
 <input type="text" class="sbq" id="yschsp-bot" name="p" value="sand making machine" autocomplete="off">

first element is not visible. so GoogleScraper is always error.

below is my code:

def _wait_until_search_input_field_appears(self, max_wait=5):
        """Waits until the search input field can be located for the current search engine

        Args:
            max_wait: How long to wait maximally before returning False.

        Returns: False if the search input field could not be located within the time
                or the handle to the search input field.
        """
        def find_visible_search_input(driver):
            inputs = driver.find_elements(*self._get_search_input_field())
            for input in inputs:
                if input.is_displayed():
                    return input
            return False
        try:
            search_input = WebDriverWait(self.webdriver, max_wait).until(find_visible_search_input)
            return search_input
        except TimeoutException as e:
            logger.error("TimeoutException waiting for search input field: {0}".format(e))
            return False
NikolaiT commented 9 years ago

I have some issues when going to the next page in google:

nikolai@nikolai:~/Projects/private/GoogleScraper$ ./run.py -m selenium -s google -q hello -p 5
2015-01-11 15:49:55,772 - GoogleScraper - INFO - 0 cache files found in .scrapecache/
2015-01-11 15:49:55,772 - GoogleScraper - INFO - 0/1 keywords have been cached and are ready to get parsed. 1 remain to get scraped.
2015-01-11 15:49:55,822 - GoogleScraper - INFO - Going to scrape 1 keywords with 1 proxies by using 1 threads.
2015-01-11 15:49:55,825 - GoogleScraper - INFO - [+] SelScrape[localhost][search-type:normal][https://www.google.com/search?] using search engine "google". Num keywords =1, num pages for keyword=5
2015-01-11 15:50:04,929 - GoogleScraper - WARNING - Cannot locate next page element: Message: unknown error: Element is not clickable at point (338, 294). Other element would receive the click: <div id="flyr" class="flyr-o" style="width: 833px; height: 1502px; top: 106px;"></div>
  (Session info: chrome=39.0.2171.65)
  (Driver info: chromedriver=2.12.301324 (de8ab311bc9374d0ade71f7c167bad61848c7c48),platform=Linux 3.13.0-37-generic x86_64)

For yahoo and bing it works. I've taken your code.