codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

ExcludeFilters are sometimes ignored #96

Open LeNightHawk opened 9 years ago

LeNightHawk commented 9 years ago

Hi,

I am writting a crawler that have to index a few thousand documents. In order to exclude some patterns, I use regular expressions (see the wonfiguration below). My problem is that sometimes, when the crawler is running, some of these filters are ignored and unwanted urls are indexed (it can be only one filter, sometimes two or three : it seems to be random while the other work perfectly). I just want to make my crawling process faster so I will get any information you can get me about that.

Here is my crawler configuration :

Mapping "page" : { "dynamic_templates" : [ { "url" : { "match" : "url", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "method" : { "match" : "method", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "charSet" : { "match" : "charSet", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "mimeType" : { "match" : "mimeType", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } } ] }

Crawler "type" : "web", "crawl" : { "index" : "my_index", "type" : "page", "url" : ["http://my_website"], "includeFilter" : ["http://my_website._"], "excludeFilter" : [".do=export.",".do=recent.",".do=backlink.",".do=diff.",".do=media.",".do=login."], "overwrite":true, "maxDepth" : 5, "maxAccessCount" : 50000, "numOfThread" : 5, "interval" : 100, "target" : [ { "pattern" : { "url" : "http://my_website._", "mimeType" : "text/html" }, "properties" : { "title" : { "text" : "title" }, "body" : { "text" : "body" }, "bodyAsHtml" : { "html" : "body" } } } ] } }

marevol commented 9 years ago

Try the following one:

"excludeFilter" : [".*do=export.*",".*do=recent.*",".*do=backlink.*",".*do=diff.*",".*do=media.*",".*do=login.*"],
LeNightHawk commented 9 years ago

Sorry, it seems that I had an error when I copied my configuration. ExcludeFilter is already :

capture

as for :

"includeFilter" : ["http://my_website.*"]

and

"url" : "http://my_website.*"

marevol commented 9 years ago

How about changing:

.*do=export.*

to

http://my_website.*do=export.*
LeNightHawk commented 9 years ago

I have already tried this but the problem stay the same. I have found another solution : thanks to the Elasticsearch DELETE API, I can clean my index. Thank you, for your plugin and for the time you gave me on this problem.