apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

[Elasticsearch] index with parent child relationship #465

Closed faustro closed 7 years ago

faustro commented 7 years ago

Wish: index documents with parent_child schema: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-parent-field.html

I tried to implement this using 1.5-SNAPSHOT. I created the index with the required mappings for parent_child.

Here is my modified IndexerBolt.java:

        IndexRequestBuilder request = connection.getClient()
                .prepareIndex(indexName, docType).setSource(builder)
                .setId(sha256hex)
                .setParent("1")
                .setRouting("1");

.setRouting works and documents get indexed If I add .setParent, documents are not indexed.

Can storm send to ES parent and routing parameters?

Might be also useful to have a file option, e.g. es.indexer.routing, es.indexer.parent

jnioche commented 7 years ago

Do the Elasticsearch logs contain any information as to why the documents do not get indexed?

In your snippet above, "1" is probably not a valid document ID, it should be the sha256hex of the parent URL.

Can storm send to ES parent and routing parameters?

You probably mean StormCrawler and not Storm but more likely to be a problem with ES itself.

From [https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-parent-child.html]

However, if a parent ID is specified, it is used as the routing value instead of the _id. In other words, both the parent and the child use the same routing value—the _id of the parent—and so they are both stored on the same shard.

If you can do it with curl then there is no reason why you shouldn't be able to do it with SC.

faustro commented 7 years ago

ES has nothing in the log, is like the query is not even sent to the ES (if I enable .setParent)

“1” in the snippet is just an example and works with setId or setRouting. Adding .setParent creates an issue. Could you please confirm you can replicate it? How can I log what is sent from SC to ES? (in local mode)

Thanks for the quick reply!

faustro commented 7 years ago

I had an error in the ES mapping definition for "_parent" that let to strange results: index created but documents not created, hence no logs...Now is working!

StormCrawler rocks!

jnioche commented 7 years ago

StormCrawler rocks!

Thanks, glad you like it! Feel free to tell the world about it ;-)