codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

include_urls doesn't work #130

Open viktor-svirsky opened 7 years ago

viktor-svirsky commented 7 years ago

Hi guys,

I have next config for crawler:

{ _index: ".river_web", _type: "config", _id: "http-fesscodelibsorg_web", _version: 1, found: true, _source: { index: "http-fesscodelibsorg", type: "http-fesscodelibsorg_web", urls: [ "http://fess.codelibs.org/" ], include_urls: [ "http://fess.codelibs.org/11.2/install/.*" ], max_depth: 10, max_access_count: 10, num_of_thread: 5, interval: 1000, robots_txt: true, target: [ { pattern: { url: "http://fess.codelibs.org/.*", mimeType: "text/html" }, properties: { title: { text: "title" }, body: { text: "body" } } } ] } }

where URLs are ["http://fess.codelibs.org/"] and include_urls are ["http://fess.codelibs.org/11.2/install/."], in my understanding crawler, should start to its work from http://fess.codelibs.org/ and indexes results which matched with http://fess.codelibs.org/11.2/install/. pattern. However, I get no results.

I checked the source, the pages are presented.

Please advise, what I do wrong.

marevol commented 7 years ago

http://fess.codelibs.org/11.2/install/.* does not match http://fess.codelibs.org/. So, include_urls needs to contain http://fess.codelibs.org/.

viktor-svirsky commented 7 years ago

Thanks for the clarification. However, I have an additional question:

Do include_urls and exclude_urls have a regexp format or some special syntax?

I have a case when I need to avoid next type of URLs: https://hostname.com/?printer=1, it's a duplicate of page https://hostname.com/. All pages have this special argument (printer=1) for printing and I want to avoid index this type of pages. An additional example is https://hostname.com/ and https://hostname.com/index.html, we prefer to exclude pages index.html and index.php pages.

Thanks

marevol commented 7 years ago

Java regex format.