Incremental crawling - Githubissues

fraserh / waldi

0 stars 0 forks source link

Incremental crawling #6

Closed matthewpalmer closed 9 years ago

matthewpalmer commented 9 years ago

Add incremental crawling so if a crawling process drops out we can specify a point or URL to start up again from

matthewpalmer commented 9 years ago

Will find a queueing system so we can jump back in when the crawler inevitably crashes

fraserhemp commented 9 years ago

Maybe we should prevent crashes by only scraping short bursts of pages, restarting phantom for each burst? That assumes we are certain it won't crash for n scrapes though.

fraserhemp commented 9 years ago

(as well as handling crashes with the queue)

matthewpalmer commented 9 years ago

Yeah that’s probably a good idea. A quick way to do it would be to have a timeout of like 30 seconds for each scrape, and if we exceed that then just crash and restart the process. Will mean that we don’t have to hard-code n scrapes.

Not sure which is easier to implement, I’m happy with whichever

fraserhemp commented 9 years ago

I guess the 30 second method would be easier? Easy enough to change if it doesn't work out

fraserhemp commented 9 years ago

btw, what are you running this on? Macbook air?

matthewpalmer commented 9 years ago

Yeah base model 2011 macbook air

matthewpalmer commented 9 years ago

Do you have any preference for the queue? Probably going to go with beanstalkd otherwise

fraserhemp commented 9 years ago

Once I get internet, I might give my bigger laptop a go.

fraserhemp commented 9 years ago

No preference, use whatever.