clemfromspace / scrapy-selenium

Scrapy middleware to handle javascript pages using selenium
Do What The F*ck You Want To Public License
921 stars 352 forks source link

Run Scrapy with more than 1 browser. #76

Open borys25ol opened 4 years ago

borys25ol commented 4 years ago

Hello,

There are any ideas how to modify Selenium Middleware to separate ongoing requests in several browsers window?

Tobeyforce commented 4 years ago

I am also curious about this since concurrency is really important for my app.

MapsGraphsCharts commented 3 years ago

There is a closed ticket about this here I believe.

https://github.com/clemfromspace/scrapy-selenium/issues/13

0b11001111 commented 3 years ago

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware

class MyMiddleware2(SeleniumMiddleware):
    # ...
Flushot commented 3 years ago

I've figured out a working solution to this issue that fits my needs, but is a bit involved (involving the need for an async driver pool). If this project is still being maintained, I'd be down to submit a PR for it if I have some free time and there's still some interest.

Because scrapy uses twisted, I found the key to this is that the middleware's process_request() method can also return a twisted.internet.defer.Deferred with a response in its callback argument.

AndreuJove commented 3 years ago

Hi @Flushot !! I'm facing the same problem, I'm interesest in your solution. How did you do?

Thanks a lot!

Andreu Jové

Flushot commented 3 years ago

@AndreuJove

The gist of it is that because process_request() can either return a standard response object or a twisted deferred (and because scrapy is itself built on twisted), the handling of downloads can be done in an asynchronous way. This opens up an opportunity for the downloader middleware to manage a pool of drivers asynchronously (and allows for concurrent requests to be sent to that pool).

The code I wrote has deviated from the version in this repo quite a bit, so I may either fork or try to find time to re-integrate. Here's a high level overview:

Hopefully that clears things up.

AndreuJove commented 3 years ago

Dear Flushot,

Thank you for your explanation. Do you have this code in a repository, I think it will be more easy for me to understand.

Thanks a lot again,

Andreu

Flushot commented 3 years ago

Unfortunately I don't yet. The code I have is private (and is coupled to private libraries). I'd be down to fork and integrate my changes when I get some free time.

vionwinnie commented 2 years ago

@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.

class MyMiddleware1(SeleniumMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(
            driver_name='firefox',
            driver_executable_path='which(geckodriver)',
            driver_arguments=[],
        )

        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)

        return middleware

class MyMiddleware2(SeleniumMiddleware):
    # ...

How do you connect the new middleware classes with the spider?