cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

consider: re-insert (or update) a crawl item if has a lower depth #36

Closed cldellow closed 1 year ago

cldellow commented 1 year ago

eg a naive crawl of cldellow.blogspot.com occasionally gets 368 urls, not 396.

This is because some URLs are discovered via longer crawl chains that end up exceeding a max depth.

On the one hand, this is non deterministic, and so kind of unfortunate.

On the other hand, I think a proper fix would take more CPU (eg ensuring we do a breadth-first scan, so sorting every time)

Actually, maybe that's OK. Let's try that

cldellow commented 1 year ago

Improving the determinism of the order in which we dequeue items improves things but doesn't ensure determinism of the overall crawl due to parallelism.

I think that's an OK tradeoff for now