codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

URL with Parameters #99

Closed marvink closed 9 years ago

marvink commented 9 years ago

Hi i've trying to ignore all URL with some specific URL Parameters for Pagination e.g. sort=, q=, page=, pageSize= All URL are starting with the query parameter so maybe i can ignore all urls with a beginning query parameter.

How can i handle this? Which wildcards can i use in the include or excludeFilters?

"includeFilter" : ["http://www.url.ch/outdoor/."], "excludeFilter" : ["http://www.url.ch/outdoor/.?.*"],

marevol commented 9 years ago

includeFilter and excludeFilter process a value as Java regexp. ? needs to be escaped.