Incremental crawl error

codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch

Apache License 2.0

234 stars 57 forks source link

Incremental crawl error #78

Open oneshot-nc opened 9 years ago

oneshot-nc commented 9 years ago

Hi.

After several successful crawls without active incremental, I decided to crawl with the "incremental crawling" option.

I crawl with and without mapping instruction for the lastModified field and every time I have this error:

[2014-12-18 22:42:13,193][DEBUG][action.search.type       ] [Crowfunding / Node / 001] [webcrawler_index][0], node[mJztTpxTQCeR5mrk8rMNww], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@656f1cec] lastShard [true]
org.elasticsearch.search.SearchParseException: [webcrawler_index][0]: query[url:http://www.[...].nc/discover/69/],from[0],size[1]: Parse Failure [Failed to parse source [{"from":0,"size":1,"query":{"term":{"url":"http://www.[...].nc//discover/69/"}},"sort":[{"lastModified":{"order":"desc"}}]}]]
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:681)
        at org.elasticsearch.search.SearchService.createContext(SearchService.java:537)
        at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:509)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:264)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.search.SearchParseException: [webcrawler_index][0]: query[url:http://www.[...].nc/discover/69/],from[0],size[1]: Parse Failure [No mapping found for [lastModified] in order to sort on]
        at org.elasticsearch.search.sort.SortParseElement.addSortField(SortParseElement.java:210)
        at org.elasticsearch.search.sort.SortParseElement.addCompoundSortField(SortParseElement.java:184)
        at org.elasticsearch.search.sort.SortParseElement.parse(SortParseElement.java:86)
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:665)
        ... 9 more

Any ideas ?

Thanks from New Caledonia

marevol commented 9 years ago

Please check a mapping for the index.

oneshot-nc commented 9 years ago

With the lastModified field :

{
    "type" : "web",
    "crawl" : {
        "index" : "webcrawler_index",
        "url" : [
        "http://www.[...].nc/discover/"
    ],
        "includeFilter" : [
        "http://www.[...].nc/discover/[0-9]+/",
        "http://www.[...].nc/[^/\\?]+/"
    ]
        "maxDepth" : 4,
        "maxAccessCount" : 1000000,
        "numOfThread" : 30,
        "incremental" : true,
        "overwrite" : true,
    "userAgent" : "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "interval" : 5000,
        "target" : [
          {
            "pattern" : {
              "url" : "http://www.[...].nc/[^(discover/)][^/\\?]+/",
              "mimeType" : "text/html"
            },
            "properties" : {
              "platform" : {
        "type" : "string",
        "index": "not_analyzed",
                "value" : "ulule"
              },
              "title" : {
        "type" : "string",
                "text" : "#project-header h1"
              },
              "init" : {
        "type" : "float",
        "index": "not_analyzed",
        "text" : "#status",
        "script" : "value = value.replaceAll(\"\\\\D+\", \"\");"
              },
              "progress" : {
        "type" : "float",
        "index": "not_analyzed",
        "text" : "#status .progress",
        "script" : "value = value.replaceAll(\"\\\\D+\", \"\");"
              },
              "description" : {
        "type" : "string",
                "text" : "#description"
              },
          "lastModified" : {
        "type" : "date",
        "format" : "dateOptionalTime"
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "* */20 * * * ?"
    }
}

Without :

{
    "type" : "web",
    "crawl" : {
        "index" : "webcrawler_index",
        "url" : [
        "http://www.[...].nc/discover/"
    ],
        "includeFilter" : [
        "http://www.[...].nc/discover/[0-9]+/",
        "http://www.[...].nc/[^/\\?]+/"
    ]
        "maxDepth" : 4,
        "maxAccessCount" : 1000000,
        "numOfThread" : 30,
        "incremental" : true,
        "overwrite" : true,
    "userAgent" : "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "interval" : 5000,
        "target" : [
          {
            "pattern" : {
              "url" : "http://www.[...].nc/[^(discover/)][^/\\?]+/",
              "mimeType" : "text/html"
            },
            "properties" : {
              "platform" : {
        "type" : "string",
        "index": "not_analyzed",
                "value" : "ulule"
              },
              "title" : {
        "type" : "string",
                "text" : "#project-header h1"
              },
              "init" : {
        "type" : "float",
        "index": "not_analyzed",
        "text" : "#status",
        "script" : "value = value.replaceAll(\"\\\\D+\", \"\");"
              },
              "progress" : {
        "type" : "float",
        "index": "not_analyzed",
        "text" : "#status .progress",
        "script" : "value = value.replaceAll(\"\\\\D+\", \"\");"
              },
              "description" : {
        "type" : "string",
                "text" : "#description"
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "* */20 * * * ?"
    }
}

marevol commented 9 years ago

It's NOT a mapping for the index. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-get-mapping.html

oneshot-nc commented 9 years ago

I have a dynamic template :

{
  "website" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "lastModified" : {
          "match" : "lastModified",
          "mapping" : {
            "type" : "date",
        "dateOptionalTime",
            "store" : "yes"
          }
        }
      }
    ]
  }
}

anfinil commented 9 years ago

I have the same error and my mapping have lastModified field:

$ curl -XGET "0.0.0.0:9200/webindex/my_web/_mapping?pretty"                                                                                            74 ↵
{
  "webindex" : {
    "mappings" : {
      "my_web" : {
        "dynamic_templates" : [ {
          "url" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "url"
          }
        }, {
          "method" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "method"
          }
        }, {
          "charSet" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "charSet"
          }
        }, {
          "mimeType" : {
            "mapping" : {
              "type" : "string",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "mimeType"
          }
        }, {
          "lastModified" : {
            "mapping" : {
              "type" : "date",
              "store" : "yes",
              "index" : "not_analyzed"
            },
            "match" : "lastModified"
          }
        } ],
        "properties" : {
          "@timestamp" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "body" : {
            "type" : "string"
          },
          "bodyAsHtml" : {
            "type" : "string"
          },
          "charSet" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : true
          },
          "contentLength" : {
            "type" : "long"
          },
          "executionTime" : {
            "type" : "long"
          },
          "httpStatusCode" : {
            "type" : "long"
          },
          "hubs" : {
            "type" : "string"
          },
          "method" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : true
          },
          "mimeType" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : true
          },
          "parentUrl" : {
            "type" : "string"
          },
          "title" : {
            "type" : "string"
          },
          "url" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : true
          }
        }
      }
    }
  }
}

anfinil commented 9 years ago

Also I looked at robot index. It also have lastModified mapping but property is always blank.

ducktype commented 9 years ago

For me the cause of: [No mapping found for [lastModified] in order to sort on]

was that i've added lastModified to the "dynamic_templates" section of the mappings (following the tutorial), but seems to be needed in the "properties" section.

setting the following mapping i get no errors:

curl -XPUT '/index1' -d '{
    "mappings": {
        "type1": {
            "properties": {
                "url": {
                    "type": "string",
                    "store": "yes",
                    "index": "not_analyzed"
                },
                "method": {
                    "type": "string",
                    "store": "yes",
                    "index": "not_analyzed"
                },
                "charSet": {
                    "type": "string",
                    "store": "yes",
                    "index": "not_analyzed"
                },
                "mimeType": {
                    "type": "string",
                    "store": "yes",
                    "index": "not_analyzed"
                },
                "lastModified": {
                    "type": "date",
                    "store": "yes",
                    "index": "analyzed"
                }
            }
        }
    }
}'

you can check the actual lastModified filed mapping by: curl -XGET '/index1/_mapping/field/lastModified'

But the actual problem remains the crawling never sets the values of that field in the documents and does not incrementally crawl the source domains (overwriting works and urls are not duplicated).

Someone can explain how incremental crawling works?

For example with the following crawiling config:

curl -XPUT '/_river/crawler1/_meta' -d '{
    "schedule": {
        "cron": "* *\/5 * * * ?"
    },
    "type": "web",
    "crawl": {
        "index": "index1",
        "type": "type1",
        "url": [
            "https:\/\/www.4chan.org\/"
        ],
        "includeFilter": [
            "https:\/\/www.4chan.org\/.*"
        ],
        "maxDepth": 3,
        "maxAccessCount": 10,
        "numOfThread": 1,
        "interval": 1000,
        "incremental": true,
        "overwrite": true,
        "target": [
            {
                "pattern": {
                    "url": ".*",
                    "mimeType": "text\/html"
                },
                "properties": {
                    "title": {
                        "text": "title"
                    }
                }
            }
        ]
    }
}'

i expected that every 5 minutes ("cron": "* \/5 * * \ ?") 10 new documents ("maxAccessCount": 10) will be indexed, but instead it reindex always the same first 10 urls starting from the first domain ("url": [ "https:\/\/www.4chan.org\/" ],)

someone knows how this incremental mode is intended to work?

without some kind of "incremental" mode the uses of this plugin are very limited!

brianvoss commented 9 years ago

+1 to @ducktype 's comment. Incremental crawling is crucial for crawling any medium to large site. I see a lot of guesswork and no clear instruction.

@marevol - can you provide guidance on this feature? the current instructions mention only the 'url' property being set to index: not_analyzed, which is clearly not sufficient.

jtandalaiLSID commented 9 years ago

+1 to @ducktype need more details on getting incremental crawling to work.