andykais / scrape-pages

generalized scraper using a single instruction set for any site that can be statically scraped
https://scrape-pages.js.org
MIT License
6 stars 2 forks source link

add option to ignore download errors #14

Open andykais opened 5 years ago

andykais commented 5 years ago

incrementUntil can stop an increment at a failed download. Outside of that though, there is no way to passively allow download failures. (e.g. a site has broken links on half of their page links, but the other half are still good). This option will fix that. With the allowRequestErrors (defaults false) flag, download failures will not bubble up. Instead, they will be marked as warnings in the log, and emitted with '<scraper>:failure'.

# config
scrapers:
  index:
    download: ...
    parse:
      selector: '.image'
      attribute: 'src'
  image:
    download:
      urlTemplate: '{{ value }}'
structure:
  scraper: index
  forEach:
    scraper: image

# options
folder: ...
optionsEach:
  image:
    allowRequestErrors: true

Something to consider, there is also plans to add a retry download option (#22). When used in tandem (e.g. options has both allowRequestErrors: true and retry: { limit: 5 }), continue retrying until the limit (5 in this case) and then log a warning. Without allowRequestErrors: true, if the retry limit is reached, then an error will be thrown.