clemfromspace / scrapy-selenium

Scrapy middleware to handle javascript pages using selenium
Do What The F*ck You Want To Public License
925 stars 354 forks source link

Handle timeout exception from selenium and still return the page #58

Open michelts opened 4 years ago

michelts commented 4 years ago

Hi @clemfromspace

I'm using the wait_time and wait_until to wait for a page to be rendered but, sometimes, the page renders a way I'm not expecting. If I don't use wait_time, I will see the rendered content (if it was faster enough), but using wait time, selenium will trigger a timeout exception and scrapy won't parse the result after all.

I wonder if this is something useful somehow, but I'm not sure. I think the approach should be the opposite, I mean, we should handle the exception and still return the found content to scrapy, so I can at least see the snapshot or see the HTML content.

michelts commented 4 years ago

Just to note, the exception got from scrapy is:

Traceback (most recent call last):
  File ".../lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File ".../lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 38, in process_request
    response = yield method(request=request, spider=spider)
  File ".../lib/python3.6/site-packages/scrapy_selenium/middlewares.py", line 115, in process_request
    request.wait_until
  File ".../lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
dustinmichels commented 3 years ago

I am also wondering how to correctly handle the TimeoutException, so I can still parse the page with scrapy even if the content doesn't load.

aivoric commented 3 years ago

I have the same issue. In my case I want to "Retry" the request which hit a selenium.common.exceptions.TimeoutException, however that also doesn't seem to work because scrapy doesn't know there was a Timeout so it can't pass the response object to the Retry Middleware.