Maintain progress of crawl in case of crappping out

MrDiggles2 / cru-scrape

Scraper of CRU sites

0 stars 0 forks source link

Maintain progress of crawl in case of crappping out #20

Closed MrDiggles2 closed 1 month ago

MrDiggles2 commented 1 month ago

As we crawl, we should provision an empty row for each future URL to scrape and then fill it when we have the content. That way, the crawler knows where to pick up in the case of an error.

Pseudo-code:

def startup(siteId)
  startUrls = scanForEmptyRows(siteId) ?? getStartUrl(siteId)

def parse(response)

  nextLinks = getNextLinks(response)

  for link in nextLinks
    provisionRow(link, siteId)
    addLinkToQueue(link)

MrDiggles2 commented 1 month ago

Fixed with https://github.com/MrDiggles2/cru-scrape/pull/21