elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

Really slow import of large batches? #52

Closed Phyks closed 8 years ago

Phyks commented 8 years ago

Hi,

I am trying to use stream2es to import a wikipedia dump in an ElasticSearch cluster. I am trying to import the subset of Wikipedia articles corresponding to the "biology" category (about 300K bzip2 compressed).

Importing a really smale subset (--max-docs 5 or 10) works fine. However, when importing the full batch or a large subset (--max-docs 1000 for instance), it runs quite infinitely and seems to get stuck (running for more than 30 minutes, no significant CPU usage nor memory usage).

Do you know if this is normal, or what is happening?

Thanks

drewr commented 8 years ago

--max-docs 1000 shouldn't cause a problem. 500k will probably cause an issue if you're streaming over HTTP. Can you torrent the archive and try a --source /path/to/enwiki-20151002-pages-articles-xml.bz2?

Phyks commented 8 years ago

@drewr Sorry, should have specified that I am already using a downloaded archive, on my disk. The dump I am using can be found at https://pub.phyks.me/tmp/biology.xml.bz2.

drewr commented 8 years ago

Ah, sorry I misunderstood your first post. I think only the enwiki-YYYYMMDD-pages-articles-xml.bz2 archive is supported by the dependency we're using to parse. You may have to write your own parser and output jsonlines to stream2es (or logstash).

Phyks commented 8 years ago

ah… Having a poor internet connection here, I did not want to download the full dump. I will try with it then.

Phyks commented 8 years ago

Seems to be working, closing the issue. Thanks!

Phyks commented 8 years ago

I downloaded the full dataset (latest dump) and ran stream2es on it. It has been running for days at the moment (3 or 4) and everything seems stable (especially document count) but the stream2es command is still running and has not returned yet (~/stream2es wiki --source /bireme/enwiki-20150901-pages-articles.xml.bz2).

FWIW, here is the imported documents:

health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   wiki    2   0   15900279            7     25.1gb         25.1gb

Not sure if it is expected, but looks like all the articles have been imported.

drewr commented 8 years ago

That's about the right number of docs. Could be a bug trying to shut down threads. If you run it again, try adding --log debug so you can see the bulks as they're indexed.

Phyks commented 8 years ago

Ok, ran it again, and got 15900279 imported documents (6 deleted) as before.

Last line in the debug log is

2015-10-31T00:17:05.684+0100 DEBUG 782:20,246 338,7d/s 662,7K/s 15900292 2256 3166981 0 47679278

and nothing more since.

funnydevnull commented 8 years ago

Is the expected import time a few days? I've been running it for a while now and am only getting a rate of 6k articles/minute which is consistent with 3-4 days for the full 15m articles. This setup is in no way taxing my system so is there no way to increase the number of threads to improve performance? I tried running with -w 6 but this did not speed things up. Maybe the problem is the number of threads elasticsearch is using to index? I'm unfortunately new to elasticsearch so can't be sure but if someone can comment on how to speed this up I'd appreciate it.

This guy claimed to index the whole dump in a few hours on a laptop (but not using stream2es)!

BrunoBonacci commented 8 years ago

I haven't tried the wikipedia dataset but the i'm using stream2es to index large datasets (100sM to 2Bn) and it is easily doing 4K-6K per second per node. Maybe the problem isn't with stream2es but with you ELS configuration (storage speed, memory available, number of shards, indexing threads etc) Did you check these parameters? Some plugins give some insights on performance tuning such as HQ plugin, KOPF and Marvel.

Here some tips for improving indexing performances: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html

On Fri, Dec 11, 2015 at 9:06 AM, funnydevnull notifications@github.com wrote:

Is the expected import time a few days? I've been running it for a while now and am only getting a rate of 6k articles/minute which is consistent with 3-4 days for the full 15m articles. This setup is in no way taxing my system so is there no way to increase the number of threads to improve performance? I tried running with -w 6 but this did not speed things up. Maybe the problem is the number of threads elasticsearch is using to index? I'm unfortunately new to elasticsearch so can't be sure but if someone can comment on how to speed this up I'd appreciate it.

This (guy)[ http://blog.trifork.com/2013/09/26/maximum-shard-size-in-elasticsearch/] claimed to index the whole dump in a few hours on a laptop!

— Reply to this email directly or view it on GitHub https://github.com/elastic/stream2es/issues/52#issuecomment-163880899.

funnydevnull commented 8 years ago

Thanks, for the help. I'll go through the performance page. I have to restart indexing because I'm actually getting errors about a missing node which might be related to me trying to run stream2es with 6 workers while only having one shard? Again, I must admit my complete ignorance of how elasticsearch works. Are you sure your records are comperable to the wikipedia articles? It looks to me like stream2es is parsing the anchor links out of the wikipedia text which presumably uses a regex so maybe the slowdown is in the preprocesing not in elasticsearch?

BrunoBonacci commented 8 years ago

Hi, the missing node error might be just due to the fact that your index configuration expect one or more replica nodes and you are running just a single node (or less than the configured). The size on the documents is certainly a big factor during the indexing, mine are 1.3K on average, but there are many other factors with influence the overall performances. For example the use of document IDs, the use of _all field, the number of fields, the size of the fields, and whether they are analyzed or not. This is just for the document itself, then there are many other cluster/node level properties. I suggest you go through the article and you will get a better idea.

On Fri, Dec 11, 2015 at 10:34 AM, funnydevnull notifications@github.com wrote:

Thanks, for the help. I'll go through the performance page. I have to restart indexing because I'm actually getting errors about a missing node which might be related to me trying to run stream2es with 6 workers while only having one shard? Again, I must admit my complete ignorance of how elasticsearch works. Are you sure your records are comperable to the wikipedia articles? It looks to me like stream2es is parsing the anchor links out of the wikipedia text which presumably uses a regex so maybe the slowdown is in the preprocesing not in elasticsearch?

— Reply to this email directly or view it on GitHub https://github.com/elastic/stream2es/issues/52#issuecomment-163904828.