brandicted / scrapy-webdriver

MIT License
143 stars 63 forks source link

OffsiteMiddleware not working #6

Open samos123 opened 11 years ago

samos123 commented 11 years ago

I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.

I need to use the offsite middleware though, so any thoughts?

I will do some hacking, on a total rewrite where there is no need for the Spider middleware and only DownloaderMiddleware or a normal Downloader. Starting to understand this stuff a little hehe.

ncadou commented 11 years ago

If I remember correctly, dont_filter=True comes from an earlier experiment where requests were not queued up in the spider middleware. They would be rescheduled in the scrapy queue and then dropped by the offsite middleware. I'm not sure why it'd still be needed though. Do you have an idea where the spider stops exactly?

Another reason for needing WebdriverSpiderMiddleware is that we need to keep track of when a spider parse method finishes working with the webdriver instance it got assigned, as until the parsing is finished, the webdriver instance should not be changed by any other spider activity. We could have the spider parse method explicitly release the webdriver instance, but that looks error-prone and in general not very clean to me. My concern here is ease of use, by making WebdriverRequest as much as a drop-in replacement for the stock Request as possible.

The spider middleware layer ended up being the best place to do the accounting and the future multiple instance management.

samos123 commented 11 years ago

Yea i noticed the same thing. No idea yet why.. been looking at the related code without much success yet.

I see yea we need the webdriver if people still want to use, couldn't we just pass a deep copy? Guess not because it would be interacting with the same remote webdriver.

You're right I think for using the webdriver in the spider, the Spider middleware seems like a nice solution. I am mostly using this for rendering the page with javascript, so didn't get to that part yet.

I've hacked something together for my own use case last night, which uses the downloader only. The offsite middleware is working fine there. I spied on https://github.com/scrapinghub/scrapyjs/ here is my result of hacking yesterday: https://github.com/samos123/scrapy-webdriver/tree/downloader-only