andykais / scrape-pages

generalized scraper using a single instruction set for any site that can be statically scraped
https://scrape-pages.js.org
MIT License
6 stars 2 forks source link

add `preventDuplicates` flag to both `parse` and `download` #11

Open andykais opened 5 years ago

andykais commented 5 years ago

Unfortunately there is not an easy atomic way to prevent duplicates using the database (e.g. 10 downloads scheduled at the same time all for the same url), so we need to use in memory states.

Note that this means a linear memory increase compared to the number of downloads/parses that happen. Important to leave a note for the user.

class ScrapeStep {
  // values are downloads.url
  private inMemoryDownloadDeduper: Set<string>
  // values are parsedTree.parsedValue
  private inMemoryParseDeduper: Set<string>
}

class AbstractDownloader {
  checkIfDuplicate: (downloadData: DownloadData) => boolean
}

class AbstractParser {
  run = (...) => {
    ...
    return parsedValues.filter(v => inMemoryDeduper.has(v)
  }
}