brandicted / scrapy-webdriver

MIT License
143 stars 63 forks source link

Stuck on Downloading for a long time #3

Open samos123 opened 11 years ago

samos123 commented 11 years ago

I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?

2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)  

Feature description: Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.

For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.

The reason that it got stuck on downloading is probably because PhantomJS crashed:

[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp

So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.

ncadou commented 11 years ago

Requests are running concurrently in scrapy in the sense that they won't block the main twisted event loop. Stock scrapy requests will therefore go through concurrently even if an unfinished webdriver request is downloading something. However, because all webdriver requests are attached to a specific webdriver instance (which itself needs to enforce sequential access for obvious reasons), and I haven't got around to implementing multiple webdriver instances support yet, in practice only one webdriver request may be performed at a time.

samos123 commented 11 years ago

Ah I see, so we basically want it as a new feature multiple webdriver instances? I'm probably being stupid, but what are the obvious reasons just wondering. I'm pretty new to the webdriver stuff.

Thanks again for your detailed reply. Helps me a lot!

On Tue, May 14, 2013 at 9:58 PM, Nicolas Cadou notifications@github.comwrote:

Requests are running concurrently in scrapy in the sense that they won't block the main twisted event loop. Stock scrapy requests will therefore go through concurrently even if an unfinished webdriver request is downloading something. However, because all webdriver requests are attached to a specific webdriver instance (which itself needs to enforce sequential access for obvious reasons), and I haven't got around to implementing multiple webdriver instances support yet, in practice only one webdriver request may be performed at a time.

— Reply to this email directly or view it on GitHubhttps://github.com/sosign/scrapy-webdriver/issues/3#issuecomment-17877501 .

ncadou commented 11 years ago

You got that exactly right, support for multiple webdriver instances would be a new feature for scrapy-webdriver. And no worries about being stupid, you have no idea how much head-banging my desk had to suffer when I was trying to make sense of twisted and scrapy. :)

As for the obvious reasons, a webdriver instance is basically like a browser with just one tab. So trying to download two things at the same time would not work at all. And then, the state of that browser and its currently loaded page need to be left untouched until the parser method in the scrapy spider has finished working with it.

samos123 commented 11 years ago

Ok I may give this feature a try if you dont mind. Gives me a reason to learn more about Twisted, Scrapy and Selenium. May take some time though, not sure if I will finish at all even, got many other stuff going on also.

I'm amazed so few are using this btw.

ncadou commented 11 years ago

I would certainly not mind contributions. As for the low usage, this project is still very young, so I'm not surprised.

stringertheory commented 11 years ago

@ncadou Do you think it would be feasible to allow for parallel scrapy-webdriver requests using multiple tabs or windows in a single webdriver instance instead of extending to multiple webdriver instances (to avoid overhead)?

ncadou commented 11 years ago

There are ways with webdriver to create tabs and windows, and switch between them, so it should be possible to implement that support in scrapy-webdriver.

IIIypuk09 commented 11 years ago

@ncadou Could you add a feature to use multiply webdrivers using one of the following settings 'CONCURRENT_REQUESTS' 'CONCURRENT_REQUESTS_PER_DOMAIN' 'CONCURRENT_REQUESTS_PER_IP'

How long to wait this feature?

ncadou commented 11 years ago

@IIIypuk09 multiple webdriver instances are planned down the line, and your suggestion about using settings makes total sense, but unfortunately I don't know when I'll have the opportunity to implement that feature.