apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

ESSeedInjector topology does not index seeds into Elasticsearch 7.0.1 #730

Closed pgg-are-my-initials closed 5 years ago

pgg-are-my-initials commented 5 years ago

The used environment:

The error:

7183 [Thread-22-metricscom.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer-executor[2 2]] INFO o.a.s.d.executor - Prepared bolt metricscom.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer:(2) 12261 [I/O dispatcher 2] WARN o.e.c.RestClient - request [POST http://localhost:9200/_bulk?timeout=1m] returned 1 warnings: [299 Elasticsearch-7.0.1-e4efcb5 "[types removal] Specifying types in bulk requests is deprecated."] 12343 [I/O dispatcher 2] ERROR c.d.s.e.p.StatusUpdaterBolt - update ID 9005907c6abf5883088107da21dd92f62273c2d8ee584099ac335a9da723d28d, failure: {"index":"status","type":"status","id":"9005907c6abf5883088107da21dd92f62273c2d8ee584099ac335a9da723d28d","cause":{"type":"exception","reason":"Elasticsearch exception [type=illegal_argument_exception, reason=Rejecting mapping update to [status] as the final mapping would have more than 1 type: [_doc, status]]"},"status":400} 12348 [I/O dispatcher 2] INFO c.d.s.e.p.StatusUpdaterBolt - Bulk response [1] : items 1, waitAck 0, acked 0, failed 1

{"type": "server", "timestamp": "2019-05-22T10:23:44,500+0000", "level": "DEBUG", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "docker-cluster", "node.name": "es01", "cluster.uuid": "doc0CjGtQnOQTBJeGavYyQ", "node.id": "xP1kHvj1QNClECd6LMAUeQ", "message": "[status][0] failed to execute bulk item (create) index {[status][status][9005907c6abf5883088107da21dd92f62273c2d8ee584099ac335a9da723d28d], source[{\"url\":\"http://www.theguardian.com/newssitemap.xml\",\"status\":\"DISCOVERED\",\"metadata\":{\"isSitemap\":[\"true\"],\"hostname\":\"www.theguardian.com\"},\"nextFetchDate\":\"2019-05-22T10:23:39.000Z\"}]}" , {"type": "server", "timestamp": "2019-05-22T10:22:52,378+0000", "level": "DEBUG", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "docker-cluster", "node.name": "es01", "cluster.uuid": "doc0CjGtQnOQTBJeGavYyQ", "node.id": "xP1kHvj1QNClECd6LMAUeQ", "message": "[metrics][0] failed to execute bulk item (index) index {[metrics][datapoint][9WMQ32oBD9QyCgkUqeDN], source[{\"srcComponentId\":\"enqueue\",\"srcTaskId\":3,\"srcWorkerHost\":\"rocco\",\"srcWorkerPort\":1027,\"name\":\"waitAck\",\"value\":0.0,\"timestamp\":\"2019-05-22T10:22:47.720Z\"}]}" , , "at java.lang.Thread.run(Thread.java:835) [?:?]"] }, "stacktrace": ["java.lang.IllegalArgumentException: Rejecting mapping update to [status] as the final mapping would have more than 1 type: [_doc, status]",, "at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:449) ~[elasticsearch-7.0.1.jar:7.0.1]",,

jnioche commented 5 years ago

which version of StormCrawler are you using? You need 1.14 in order to use ES7

sebastian-nagel commented 5 years ago
Rejecting mapping update to [status] as the final mapping would have more than 1 type: [_doc, status]"

looks like there is a mixture of StormCrawler 1.14 (doc type "_doc") and 1.13 or below (doc type "status"): @aswencio22222, has the the index been upgraded from an older version?

pgg-are-my-initials commented 5 years ago

Ok was a version problem as you pointed. Thanks to both, keep it up!