elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

High cpu usage #49

Open diadistis opened 9 years ago

diadistis commented 9 years ago

Setup

  1. Latest stream2es (20150720170522978252e) on server (6 cores / 64GB ram) separate from the es cluster
  2. A big (~65GB) file containing 1 large json object per line. There are about 15 million lines/documents and the average line size is ~4.3k characters

    Problem

I'm running :

cat bigfile | stream2es stdin --target http://server:9200/index/type --log debug -w 12

I have tried several different options for --bulk-bytes, -w, -d and -q but always the same result. I'm getting a constant indexing speed of ~5MB/s which translates to 4 hours to import the file. While indexing the elasticsearch cluster is heavily under-utilized and the stream2es server has a single core at 100%. I have done extensive testing to ensure that there are no network or elasticsearch performance issues.

Workaround

My final solution was to run stream2es in parallel (not with -w) to see if that would help.

cat bigfile | parallel -j12 -L5000 --pipe "stream2es stdin --target http://server:9200/index/type"

That helped a lot. Now all 6 cores and 12 threads get 100% and the indexing time fell from 4 hours to 35 minutes but the elasticsearch cluster is still pretty much idle. It seems to me that something in stream2es uses way more cpu than it should.

drewr commented 8 years ago

Thanks for reporting this @diadistis, and sorry for the terrible response time. I've noticed similar, and I've done similar workarounds. I haven't had a chance to do profiling on the internal design to isolate the bottleneck, but I suspect at the very least the single LinkedBlockingQueue that feeds the pipeline is part of it.

I did just push a fix for some extraneous string copying, but it won't speed anything up 8x. If you still have this environment available I'd love to know its effect.