BIDS-projects / scraper

Collects data from websites of data science institutions
2 stars 0 forks source link

Solve hogging problem #29

Open don-han opened 8 years ago

don-han commented 8 years ago

Suggested by @alvinwan:

priority http://doc.scrapy.org/en/latest/topics/request-response.html

alvinwan commented 8 years ago

Just for future reference, we could use the priority kwarg to take advantage of the inherent PQ that scrapy has built-in for requests. Here was what I posted Slack:

[2:17]
...since higher priority values correspond to, well, higher priority, just
 take the difference between max_depth and the depth of the current
 page and pass that in as the priority. We take the difference because
 we want higher priority to correspond to lower depth, effecting a bfs
 by page-depth. I don't remember if this is the case, but we'd have to
 enqueue all domains first though, so that it doesn't start bfs... on one
 domain.