codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

Riverweb stops after indexing < 200 pages #94

Closed neilneyman closed 9 years ago

neilneyman commented 9 years ago

Hi:

Using riverweb version 1.5, elasticsearch 1.5.2, java 1.8.0_45. The River runs for a few seconds and then shuts itself down. I have tried adjusting various values for maxDepth, maxAccessCount, (and omitted those entirely) and numofThread with the same results. The index usually ends up with something like around 150-200 documents for two sites. It looks like it crawls only about 100 documents from each base url.

The riverweb logfile shows it was indexing normally, then stopped with no warnings or errors. Elasticsearch log does not show any warnings or errors during the crawling/indexing.

Here is my config and index settings.

$ curl -XPUT 'localhost:9200/my_custom_index' -d '
{  
  "settings":{  
    "index":{  
      "refresh_interval":"1s",
      "number_of_shards":"10",
      "number_of_replicas" : "0"
    }
  },
  "mappings":{  
    "web":{  
      "properties":{  
        "url":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "method":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "charSet":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "mimeType":{  
          "type":"string",
          "index":"not_analyzed"
        }
      }
    }
  }
}'

curl -XPUT 'localhost:9200/.river_web/config/web' -d '{
"index" : "my_custom_index",
"type" : "web",
    "url" : ["http://mysite.com/", "http://subdomain.mysite.com/wiki/"],
    "includeFilter" : ["http://mysite.com/.*", "http://subdomain.mysite.com/wiki/.*"],
        "maxDepth" : 500,
    "maxAccessCount" : 300000,
    "numOfThread" : 1,
    "interval" : 1,
    "incremental" : true,
    "overwrite"   : true,
        "target" : [
        {
            "pattern" : {
                "url" : "http://mysite.com/.*",
                "mimeType" : "text/html"
            },
             "properties" : {
                "title" : { 
                    "text" : "title"
                },
                "body" : {
                    "text" : "body"
                },
                "bodyAsHtml" : {
                    "html" : "body"
                }
            }
        },
        {
            "pattern" : {
                "url" : "http://subdomain.mysite.com/wiki/.*",
                "mimeType" : "text/html"
            },
             "properties" : {
                "title" : { 
                    "text" : "title"
                },
                "body" : {
                    "text" : "body"
                },
                "bodyAsHtml" : {
                    "html" : "body"
                }
            }
        }
    ]
}' 
marevol commented 9 years ago

Colud you replace with River Web 1.5.1?

c4mden commented 9 years ago

I was facing a similar problem this week, but it appears that version 1.5.1 has resolved it. Thank you!

silvsinaga commented 9 years ago

why it doesn't collect some of document?

marevol commented 9 years ago

It's just a bug... 1.5.1 is a bug-fixes release.

neilneyman commented 9 years ago

Thanks! I haven't had a chance to try because we've for the moment downgraded to ES 1.4x for other reasons, but thanks for fixing! I'll let you know if it worked when we upgrade again

neilneyman commented 9 years ago

Looks like this is working with 1.5.1 now, thanks