HazyResearch / mindbender

Tools for iterative knowledge base development with DeepDive
116 stars 32 forks source link

Single-threaded jq can be a bottleneck for ES indexing #55

Open alldefector opened 8 years ago

alldefector commented 8 years ago

We could backport this change to parallelize indexing: https://github.com/HazyResearch/mindbender/commit/bc869e855b62104928506d15611fb2329c786b12

It simply uses parallel instead of split. These improvements for a backport would be great:

alldefector commented 8 years ago

Here is another indexing speed optimization: https://github.com/HazyResearch/mindbender/commit/8d4169ab6784236f21e3caf7c794830f54b66357

netj commented 8 years ago

Thanks for the suggestions! Yeah I was anticipating we'd need parallel indexing pretty soon. I had bad experience with GNU parallel–it was unstable, bloated, CLI changing too much across versions–but will backport these soon maybe using the more familiar xargs or embedding an exact version of parallel.

Side question: After parallelizing, is there any sign of ES being the new bottleneck? Would adding more nodes to the ES cluster help? The keep-elasticsearch-during currently launches an isolated single node ES server, but we could enhance it and introduce a subcommand like mindbender search join-cluster to make it easy to scale out.

alldefector commented 8 years ago

No, ES seems to have a very flexible thread pool scheme in one node and can saturate all cores. I suspect that even if there is only one shard, it's still able to saturate all cores. If hardware is the bottleneck, then yeah, we could add new node support.

netj commented 8 years ago

I see. Sounds like deciding the cluster size should depend on query time latency requirement.

alldefector commented 8 years ago

Another key performance knob is ES_HEAP_SIZE: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

But default ES's heap size is 0.25-1G. We may want to use a different default.