Unfortunately there is not an easy atomic way to prevent duplicates using the database (e.g. 10 downloads scheduled at the same time all for the same url), so we need to use in memory states.
Note that this means a linear memory increase compared to the number of downloads/parses that happen. Important to leave a note for the user.
class ScrapeStep {
// values are downloads.url
private inMemoryDownloadDeduper: Set<string>
// values are parsedTree.parsedValue
private inMemoryParseDeduper: Set<string>
}
class AbstractDownloader {
checkIfDuplicate: (downloadData: DownloadData) => boolean
}
class AbstractParser {
run = (...) => {
...
return parsedValues.filter(v => inMemoryDeduper.has(v)
}
}
Unfortunately there is not an easy atomic way to prevent duplicates using the database (e.g. 10 downloads scheduled at the same time all for the same url), so we need to use in memory states.
Note that this means a linear memory increase compared to the number of downloads/parses that happen. Important to leave a note for the user.