Closed matthewpalmer closed 9 years ago
Will find a queueing system so we can jump back in when the crawler inevitably crashes
Maybe we should prevent crashes by only scraping short bursts of pages, restarting phantom for each burst? That assumes we are certain it won't crash for n scrapes though.
(as well as handling crashes with the queue)
Yeah that’s probably a good idea. A quick way to do it would be to have a timeout of like 30 seconds for each scrape, and if we exceed that then just crash and restart the process. Will mean that we don’t have to hard-code n scrapes.
Not sure which is easier to implement, I’m happy with whichever
I guess the 30 second method would be easier? Easy enough to change if it doesn't work out
btw, what are you running this on? Macbook air?
Yeah base model 2011 macbook air
Do you have any preference for the queue? Probably going to go with beanstalkd otherwise
Once I get internet, I might give my bigger laptop a go.
No preference, use whatever.
Add incremental crawling so if a crawling process drops out we can specify a point or URL to start up again from