Closed cldellow closed 1 year ago
eg a naive crawl of cldellow.blogspot.com occasionally gets 368 urls, not 396.
This is because some URLs are discovered via longer crawl chains that end up exceeding a max depth.
On the one hand, this is non deterministic, and so kind of unfortunate.
On the other hand, I think a proper fix would take more CPU (eg ensuring we do a breadth-first scan, so sorting every time)
Actually, maybe that's OK. Let's try that
Improving the determinism of the order in which we dequeue items improves things but doesn't ensure determinism of the overall crawl due to parallelism.
I think that's an OK tradeoff for now
eg a naive crawl of cldellow.blogspot.com occasionally gets 368 urls, not 396.
This is because some URLs are discovered via longer crawl chains that end up exceeding a max depth.
On the one hand, this is non deterministic, and so kind of unfortunate.
On the other hand, I think a proper fix would take more CPU (eg ensuring we do a breadth-first scan, so sorting every time)
Actually, maybe that's OK. Let's try that