JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.3k stars 1.17k forks source link

Add ability to batch ElasticSearch updates when reindexing #483

Open jrust opened 7 years ago

jrust commented 7 years ago

After upgrading to ES5 I found that re-indexing a relatively small graph got significantly slower. Traced it down to needing to set index.translog.durability to async. That speeds it up, but it does mean that there's some loss in reliability if there's a problem while reindexing. The ES recommendations on speeding up reindexing suggest indexing several documents at once. Currently the ElasticSearch reindex code does use the _bulk endpoint, but it is sending a separate update index request for each vertex. Has there been discussion or plans to to cut down on the number of requests by using this method during reindexing?

davidclement90 commented 7 years ago

I'm have the same issue. I think to speed up we need to change two thinks : 1) drop the index before redexing to delete ghost element and because ES and Solr try to update the document and it's slower than add a document. 2) add a flush method as you suggest

I think we have the same issue in Solr.

davidclement90 commented 7 years ago

I do not know well this part of the code but you can try to enable batch loading.

jrust commented 7 years ago

A couple things I've noticed:

  1. Re-indexing a non-mixed index that does not use ES is plenty fast. The graph only has about 70k vertices and takes only a few seconds to reindex a regular index, but 10+ minutes for an ES-backed mixed index. The slowness is most evident on Windows, it seems to struggle more with the ES default of fsyncing the translog every update.
  2. The difficulty I anticipate in adding bulk indexing is the size of the bulk request. The ES docs suggest scaling up exponentially until you hit the maximum throughput, but that seems difficult to do dynamically. So maybe it would just be a setting of how many documents to index at once.
  3. In my case I'm starting with an empty ES index, so I'm not hitting the ghost element issue, but I agree that the ability to drop an index completely would be useful.