Open LeNightHawk opened 9 years ago
Try the following one:
"excludeFilter" : [".*do=export.*",".*do=recent.*",".*do=backlink.*",".*do=diff.*",".*do=media.*",".*do=login.*"],
Sorry, it seems that I had an error when I copied my configuration. ExcludeFilter is already :
as for :
"includeFilter" : ["http://my_website.*"]
and
"url" : "http://my_website.*"
How about changing:
.*do=export.*
to
http://my_website.*do=export.*
I have already tried this but the problem stay the same. I have found another solution : thanks to the Elasticsearch DELETE API, I can clean my index. Thank you, for your plugin and for the time you gave me on this problem.
Hi,
I am writting a crawler that have to index a few thousand documents. In order to exclude some patterns, I use regular expressions (see the wonfiguration below). My problem is that sometimes, when the crawler is running, some of these filters are ignored and unwanted urls are indexed (it can be only one filter, sometimes two or three : it seems to be random while the other work perfectly). I just want to make my crawling process faster so I will get any information you can get me about that.
Here is my crawler configuration :
Mapping "page" : { "dynamic_templates" : [ { "url" : { "match" : "url", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "method" : { "match" : "method", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "charSet" : { "match" : "charSet", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } }, { "mimeType" : { "match" : "mimeType", "mapping" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } } ] }
Crawler "type" : "web", "crawl" : { "index" : "my_index", "type" : "page", "url" : ["http://my_website"], "includeFilter" : ["http://my_website._"], "excludeFilter" : [".do=export.",".do=recent.",".do=backlink.",".do=diff.",".do=media.",".do=login."], "overwrite":true, "maxDepth" : 5, "maxAccessCount" : 50000, "numOfThread" : 5, "interval" : 100, "target" : [ { "pattern" : { "url" : "http://my_website._", "mimeType" : "text/html" }, "properties" : { "title" : { "text" : "title" }, "body" : { "text" : "body" }, "bodyAsHtml" : { "html" : "body" } } } ] } }