Closed kelmahrsi closed 10 years ago
Hi,
Any error in log file ? Can you add unique ID column to CSV file and to the river configuration ?
id for each row is generated by UUID.randomUUID().toString() by default
What's your environment ? HW,OS,JDK ?
Bedy
First of all, thank you for the very useful plugin.
I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d ' { "type" : "csv", "csv_file" : { "folder" : "/tmp", "filename_pattern" : ".*.csv$", "poll":"5m", "first_line_is_header":"true", "field_separator" : ";", "escape_character" : "\n", "quote_character" : "\"" }, "index" : { "index" : "my_index", "type" : "item", "bulk_size" : 100, "bulk_threshold" : 10 } }' After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, in Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?
— Reply to this email directly or view it on GitHub.
Initially I was running Elasticsearch 1.0.0 on a Ubuntu 13.10 VM with 2Gb RAM and with JVM 1.7.0_51 then I switched to testing on Mac OS X 10.9.1 with 8Gb RAM and JVM 1.6.0_65 (in both cases, the processor is an Intel® Core™ i7-3540M CPU @ 3.00GHz × 2 ).
The log file does not mention any errors:
[2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files /Users/Khalil/Documents/london_bss/data/londonbss.csv
[2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Processing file londonbss.csv
[2014-02-26 18:51:51,044][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2014-02-26 18:51:51,967][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] update_mapping [journey_nomapping] (dynamic)
[2014-02-26 18:51:52,297][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,482][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,698][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,999][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,099][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,309][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,404][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,483][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,641][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,747][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,941][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,195][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,313][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,415][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:00,126][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:25,187][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:25,283][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:33,634][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,025][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,058][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:16,043][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:16,056][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 19:04:58,212][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] creating index, cause [auto(index api)], shards [5]/[1], mappings []
[2014-02-26 19:04:58,871][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] update_mapping [dashboard] (dynamic)
[2014-02-26 19:05:43,411][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] File londonbss.csv.processing, processed lines 5061972
[2014-02-26 19:05:43,426][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m
[2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/london_bss/data, .*\.csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id)
[2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {}
[2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m
[2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/london_bss/data, .*\.csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id)
[2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {}
[2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m
(this one is from the complete CSV file containing more than 5Million records)
I do have a column that can play the role of an id. It's called rental_id
(which I map to an integer
). Sorry for the stupid question but: “How can I indicate that it's the id field in the river configuration?”
I update documentation.
You must use: "field_id" : "rental_d" and use header line in CSV file or set "first_line_is_header" : "true" and fill "fields" : [ "column1", "column2", "column3", "column4" ], property
Bedy
Initially I was running Elasticsearch 1.0.0 on a Ubuntu 13.10 VM with 2Gb RAM and with JVM 1.7.0_51 then I switched to testing on Mac OS X 10.9.1 with 8Gb RAM and JVM 1.6.0_65 (in both cases, the processor is an Intel® Core™ i7-3540M CPU @ 3.00GHz × 2 ).
The log file does not mention any errors:
[2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files /Users/Khalil/Documents/london_bss/data/londonbss.csv [2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Processing file londonbss.csv [2014-02-26 18:51:51,044][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] creating index, cause [auto(bulk api)], shards [5]/[1], mappings [] [2014-02-26 18:51:51,967][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] update_mapping journey_nomapping [2014-02-26 18:51:52,297][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,482][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,698][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,999][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,099][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,309][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,404][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,483][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,641][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,747][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,941][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,195][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,313][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,415][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:00,126][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:25,187][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:25,283][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:33,634][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,025][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,058][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:16,043][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:16,056][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 19:04:58,212][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] creating index, cause [auto(index api)], shards [5]/[1], mappings [] [2014-02-26 19:04:58,871][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] update_mapping dashboard [2014-02-26 19:05:43,411][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] File londonbss.csv.processing, processed lines 5061972 [2014-02-26 19:05:43,426][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/londonbss/data, ..csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id) [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {} [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/londonbss/data, ..csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id) [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {} [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m (this one is from the complete CSV file containing more than 5Million records)
I do have a column that can play the role of an id. It's called rental_id (which I map to an integer). Sorry for the stupid question but: “How can I indicate that it's the id field in the river configuration?”
— Reply to this email directly or view it on GitHub.
I just did that and it didn't solve the problem (and the log file still doesn't mention the slightest error).
@mahrsi could you provide us with test CSV file which causes the issue?
@tajzivit you can download a test CSV file (containing the first 250000 records of the original) from here: https://dl.dropboxusercontent.com/u/8280898/data.csv
I tested with the test file and the problem still occurs (only 27462 records imported in Elasticsearch).
Thanks for example data file.
Please, try to build sources from master and run it again. It should be OK, when some error occurs it will skip the line or file. You will see it in log.
Thank you very much! I tried the new version and indeed the file imported correctly!
i got same issue but my csv has got 4,600,000 rows and the import stops around 750000... i got the latest plugin ... i played around with bulk_size.. the less i took the more i can import... but the difference is about +-2000 more or less.. never the less 750000 to 4,2millions ...is alot ;(
can you provide us with sample file? Please, create a new issue.
First of all, thank you for the very useful plugin.
I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:
After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?