Open borys25ol opened 4 years ago
I am also curious about this since concurrency is really important for my app.
There is a closed ticket about this here I believe.
@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.
class MyMiddleware1(SeleniumMiddleware):
@classmethod
def from_crawler(cls, crawler):
middleware = cls(
driver_name='firefox',
driver_executable_path='which(geckodriver)',
driver_arguments=[],
)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
class MyMiddleware2(SeleniumMiddleware):
# ...
I've figured out a working solution to this issue that fits my needs, but is a bit involved (involving the need for an async driver pool). If this project is still being maintained, I'd be down to submit a PR for it if I have some free time and there's still some interest.
Because scrapy uses twisted, I found the key to this is that the middleware's process_request()
method can also return a twisted.internet.defer.Deferred
with a response in its callback argument.
Hi @Flushot !! I'm facing the same problem, I'm interesest in your solution. How did you do?
Thanks a lot!
Andreu Jové
@AndreuJove
The gist of it is that because process_request()
can either return a standard response object or a twisted deferred (and because scrapy is itself built on twisted), the handling of downloads can be done in an asynchronous way. This opens up an opportunity for the downloader middleware to manage a pool of drivers asynchronously (and allows for concurrent requests to be sent to that pool).
The code I wrote has deviated from the version in this repo quite a bit, so I may either fork or try to find time to re-integrate. Here's a high level overview:
process_request()
(which will ultimately handle the request asynchronously by returning a deferred instead of a response object).webdriver.Remote
or a local webdriver.Firefox
/webdriver.Chrome
session depending on whether Selenium Grid is being used (which I registered under a new config key called SELENIUM_HUB_URL
).process_spider_output
or process_spider_exception
respectively) so that other pending requests can use it again.request.meta['driver']
object reliably.about:blank
) or the driver is quit()
so that the next request will cause a new allocation.Hopefully that clears things up.
Dear Flushot,
Thank you for your explanation. Do you have this code in a repository, I think it will be more easy for me to understand.
Thanks a lot again,
Andreu
Unfortunately I don't yet. The code I have is private (and is coupled to private libraries). I'd be down to fork and integrate my changes when I get some free time.
@borys25ol one possible way is by subclassing the middleware and statically configuring it. Not the prettiest solution but is should be straight forward to implement.
class MyMiddleware1(SeleniumMiddleware): @classmethod def from_crawler(cls, crawler): middleware = cls( driver_name='firefox', driver_executable_path='which(geckodriver)', driver_arguments=[], ) crawler.signals.connect(middleware.spider_closed, signals.spider_closed) return middleware class MyMiddleware2(SeleniumMiddleware): # ...
How do you connect the new middleware classes with the spider?
Hello,
There are any ideas how to modify Selenium Middleware to separate ongoing requests in several browsers window?