Fixed URL crawl/save filtering. If multiple filters are supplied, any match will cause the URL to be treated as a match. Explicit rejection of URLs is still possible using the full UrlFilter syntax; i.e., crawl: { propert: 'hostname', glob: '*foo.com', reject: true }.
Added a collapseSearchParams normalizer option, so borked URL Search Param values like page=1?page=2?page=3 can be collapsed to the last value in the list. The config value should be a glob pattern matching Search Param keys; i.e., 'name' or '{name,id,search}' etc.
Added support for stealth crawling; setting spider.stealth to TRUE in the Spidergram config will use the playwright-extras plugin to mask the crawler's identity. This is experimental and turned off by default; some pages currently cause it to crash the spider, requiring repeated restarts of the crawler to finish a site.
Added a delete CLI command that can be used to remove crawl records and dependent relationships. It uses the same filtering syntax as the query CLI command, but is obviously much more dangerous. Using query first, then deleteing when you know you're sure of the results, is strongly recommended. This is particularly useful, though, when you'd like to 'forget' and re-crawl a set of pages. In the future we'll be adding support for explicitly recrawling without this dangerous step, but for now it's quite handy.
crawl: { propert: 'hostname', glob: '*foo.com', reject: true }
.collapseSearchParams
normalizer option, so borked URL Search Param values likepage=1?page=2?page=3
can be collapsed to the last value in the list. The config value should be a glob pattern matching Search Param keys; i.e.,'name'
or'{name,id,search}'
etc.spider.stealth
to TRUE in the Spidergram config will use theplaywright-extras
plugin to mask the crawler's identity. This is experimental and turned off by default; some pages currently cause it to crash the spider, requiring repeated restarts of the crawler to finish a site.delete
CLI command that can be used to remove crawl records and dependent relationships. It uses the same filtering syntax as thequery
CLI command, but is obviously much more dangerous. Usingquery
first, thendelete
ing when you know you're sure of the results, is strongly recommended. This is particularly useful, though, when you'd like to 'forget' and re-crawl a set of pages. In the future we'll be adding support for explicitly recrawling without this dangerous step, but for now it's quite handy.@axe-core/playwright
to version 4.7.1