I have recently replaced my bulk import mechanism (PHP and bulk API) with river csv. What I have noticed so far is a strange behavior that shows up after a certain index size (around 10.000.000 docs and ~1.5G disk size). So when the index is small everything works fine, I have set the bulk_size=1000, concurrent_requests=4 and bulk_threashold=10. After a couple of hours when index become bigger the whole process slows down and the import of .csv files becomes really slow. I have checked the elastic .log files and I figured out that the execution circle (polling time) of the import is interrupted. For instance here is what I get from the logs
logs
[2015-02-23 20:08:55,135][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb7eed2bbe9.csv.processing
[2015-02-23 20:08:55,136][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb7eed2bbe9.csv.processing, processed lines 2300
[2015-02-23 20:08:55,137][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb81de7e37f.csv
[2015-02-23 20:08:55,146][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:09:52,079][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:09:54,170][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:09:54,286][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:41,762][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:10:41,911][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:52,411][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:11:37,582][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:11:37,758][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb81de7e37f.csv.processing
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb81de7e37f.csv.processing, processed lines 2985
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb807bf351c.csv
[2015-02-23 20:11:37,765][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:12:02,830][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:12:30,479][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:12:30,536][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:13:03,132][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:13:24,458][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:13:24,581][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:14:03,423][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:14:12,914][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb807bf351c.csv.processing
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb807bf351c.csv.processing, processed lines 2924
[2015-02-23 20:14:13,011][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb7eb509a30.csv
[2015-02-23 20:14:13,032][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:11,204][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:15:11,311][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:13,741][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
As you can see there is no accuracy between time periods. The one circle ends at 2015-02-23 20:13:24 and the next start at 2015-02-23 20:14:12. Next you can find the csv river and index settings
*\ Is store.throttle_time_in_millis: 669 cosidered as an important factor? I am asking since I use doc_values on my mapping so maybe I am pushig too much my little VM :)
Finally I did notice some high I/O traffic with iotop
Hi all,
I have recently replaced my bulk import mechanism (PHP and bulk API) with river csv. What I have noticed so far is a strange behavior that shows up after a certain index size (around 10.000.000 docs and ~1.5G disk size). So when the index is small everything works fine, I have set the bulk_size=1000, concurrent_requests=4 and bulk_threashold=10. After a couple of hours when index become bigger the whole process slows down and the import of .csv files becomes really slow. I have checked the elastic .log files and I figured out that the execution circle (polling time) of the import is interrupted. For instance here is what I get from the logs
logs
[2015-02-23 20:08:55,135][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb7eed2bbe9.csv.processing [2015-02-23 20:08:55,136][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb7eed2bbe9.csv.processing, processed lines 2300 [2015-02-23 20:08:55,137][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb81de7e37f.csv [2015-02-23 20:08:55,146][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:09:52,079][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]] [2015-02-23 20:09:54,170][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:09:54,286][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:10:41,762][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:10:41,911][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:10:52,411][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out] [2015-02-23 20:11:37,582][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:11:37,758][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb81de7e37f.csv.processing [2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb81de7e37f.csv.processing, processed lines 2985 [2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb807bf351c.csv [2015-02-23 20:11:37,765][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:12:02,830][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out] [2015-02-23 20:12:30,479][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:12:30,536][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:13:03,132][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]] [2015-02-23 20:13:24,458][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:13:24,581][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:14:03,423][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out] [2015-02-23 20:14:12,914][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb807bf351c.csv.processing [2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb807bf351c.csv.processing, processed lines 2924 [2015-02-23 20:14:13,011][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb7eb509a30.csv [2015-02-23 20:14:13,032][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:15:11,204][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions [2015-02-23 20:15:11,311][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions [2015-02-23 20:15:13,741][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
As you can see there is no accuracy between time periods. The one circle ends at 2015-02-23 20:13:24 and the next start at 2015-02-23 20:14:12. Next you can find the csv river and index settings
CSV River
Index mapping
elasticasearch.yml
index.refresh_interval: 30s index.translog.flush_threshold_ops: 50000 index.translog.flush_threshold_size: 512mb indices.fielddata.cache.size: 20% indices.cache.filter.size: 20% indices.memory.index_buffer_size: 40% index.merge.scheduler.max_thread_count : 1 bootstrap.mlockall: true
/etc/sysconfig/elasticsearch
MAX_LOCKED_MEMORY=unlimited MAX_OPEN_FILES=65535 ES_JAVA_OPTS=-server ES_HEAP_SIZE=512m
index status
index stats
*\ Is store.throttle_time_in_millis: 669 cosidered as an important factor? I am asking since I use doc_values on my mapping so maybe I am pushig too much my little VM :)
Finally I did notice some high I/O traffic with iotop
Here is the sys info
Vagrant OS: CentOS release 6.6 RAM: 2GB CPU: Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (2 cores)
Thanks a lot for your time
Regards, Alex
The proxylab team | http://www.proxylab.io/