As we crawl, we should provision an empty row for each future URL to scrape and then fill it when we have the content. That way, the crawler knows where to pick up in the case of an error.
Pseudo-code:
def startup(siteId)
startUrls = scanForEmptyRows(siteId) ?? getStartUrl(siteId)
def parse(response)
nextLinks = getNextLinks(response)
for link in nextLinks
provisionRow(link, siteId)
addLinkToQueue(link)
As we crawl, we should provision an empty row for each future URL to scrape and then fill it when we have the content. That way, the crawler knows where to pick up in the case of an error.
Pseudo-code: