Run Scrapy with more than 1 browser.

borys25ol commented 4 years ago

Hello,

There are any ideas how to modify Selenium Middleware to separate ongoing requests in several browsers window?

Tobeyforce commented 4 years ago

I am also curious about this since concurrency is really important for my app.

MapsGraphsCharts commented 3 years ago

There is a closed ticket about this here I believe.

https://github.com/clemfromspace/scrapy-selenium/issues/13

0b11001111 commented 3 years ago

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware

class MyMiddleware2(SeleniumMiddleware):
    # ...

Flushot commented 3 years ago

I've figured out a working solution to this issue that fits my needs, but is a bit involved (involving the need for an async driver pool). If this project is still being maintained, I'd be down to submit a PR for it if I have some free time and there's still some interest.

Because scrapy uses twisted, I found the key to this is that the middleware's process_request() method can also return a twisted.internet.defer.Deferred with a response in its callback argument.

AndreuJove commented 3 years ago

Hi @Flushot !! I'm facing the same problem, I'm interesest in your solution. How did you do?

Thanks a lot!

Andreu Jové

Flushot commented 3 years ago

@AndreuJove

The gist of it is that because process_request() can either return a standard response object or a twisted deferred (and because scrapy is itself built on twisted), the handling of downloads can be done in an asynchronous way. This opens up an opportunity for the downloader middleware to manage a pool of drivers asynchronously (and allows for concurrent requests to be sent to that pool).

The code I wrote has deviated from the version in this repo quite a bit, so I may either fork or try to find time to re-integrate. Here's a high level overview:

Request is made by the spider.
Downloader middleware handles it with process_request() (which will ultimately handle the request asynchronously by returning a deferred instead of a response object).
Downloader middleware attempts to check driver out of a fixed-size async driver pool.
Driver pool waits on a semaphore and either tries to reuse a previously checked in driver, or starts a new webdriver.Remote or a local webdriver.Firefox/webdriver.Chrome session depending on whether Selenium Grid is being used (which I registered under a new config key called SELENIUM_HUB_URL).
Once a driver is available/allocated, the deferred is resolved.
- If anything fails up to this point, the deferred may be rejected.
- If the failure was temporary (e.g. timeout), the deferred is resolved with another request (with priority lowered), so that Scrapy will re-schedule a retry.
When the request is finished being processed (or an exception was raised), the following will happen:
- Another spider middleware will check the driver back into the pool (via process_spider_output or process_spider_exception respectively) so that other pending requests can use it again.
- Note: If this extra middleware doesn't handle the finalization (and the deferred is immediately resolved by the downloader middleware), the spider won't be able to interact with the request.meta['driver'] object reliably.
- Depending on configured policy: When the driver is checked back into the pool, it may be reused for another requests (and is "cleared" by navigating to about:blank) or the driver is quit() so that the next request will cause a new allocation.

Hopefully that clears things up.

AndreuJove commented 3 years ago

Dear Flushot,

Thank you for your explanation. Do you have this code in a repository, I think it will be more easy for me to understand.

Thanks a lot again,

Andreu

Flushot commented 3 years ago

Unfortunately I don't yet. The code I have is private (and is coupled to private libraries). I'd be down to fork and integrate my changes when I get some free time.

vionwinnie commented 2 years ago

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware

class MyMiddleware2(SeleniumMiddleware):
    # ...

How do you connect the new middleware classes with the spider?

clemfromspace / scrapy-selenium

Run Scrapy with more than 1 browser. #76