AgileWorksOrg / elasticsearch-river-csv

CSV river for ElasticSearch
Apache License 2.0
91 stars 45 forks source link

Incomplete Data #9

Closed kelmahrsi closed 10 years ago

kelmahrsi commented 10 years ago

First of all, thank you for the very useful plugin.

I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/tmp",
        "filename_pattern" : ".*\\.csv$",
        "poll":"5m",
        "first_line_is_header":"true",
        "field_separator" : ";",
        "escape_character" : "\n",
        "quote_character" : "\""
    },
    "index" : {
        "index" : "my_index",
        "type" : "item",
        "bulk_size" : 100,
        "bulk_threshold" : 10
    }
}'

After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?

xxBedy commented 10 years ago

Hi,

Any error in log file ? Can you add unique ID column to CSV file and to the river configuration ?

id for each row is generated by UUID.randomUUID().toString() by default

What's your environment ? HW,OS,JDK ?

Bedy

    1. 2014 v 11:27, Mohamed Khalil El Mahrsi notifications@github.com:

First of all, thank you for the very useful plugin.

I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d ' { "type" : "csv", "csv_file" : { "folder" : "/tmp", "filename_pattern" : ".*.csv$", "poll":"5m", "first_line_is_header":"true", "field_separator" : ";", "escape_character" : "\n", "quote_character" : "\"" }, "index" : { "index" : "my_index", "type" : "item", "bulk_size" : 100, "bulk_threshold" : 10 } }' After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, in Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?

— Reply to this email directly or view it on GitHub.

kelmahrsi commented 10 years ago

Initially I was running Elasticsearch 1.0.0 on a Ubuntu 13.10 VM with 2Gb RAM and with JVM 1.7.0_51 then I switched to testing on Mac OS X 10.9.1 with 8Gb RAM and JVM 1.6.0_65 (in both cases, the processor is an Intel® Core™ i7-3540M CPU @ 3.00GHz × 2 ).

The log file does not mention any errors:

[2014-02-26 18:51:50,804][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Going to process files /Users/Khalil/Documents/london_bss/data/londonbss.csv
[2014-02-26 18:51:50,804][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Processing file londonbss.csv
[2014-02-26 18:51:51,044][INFO ][cluster.metadata         ] [Stiltzkin] [london_bss] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2014-02-26 18:51:51,967][INFO ][cluster.metadata         ] [Stiltzkin] [london_bss] update_mapping [journey_nomapping] (dynamic)
[2014-02-26 18:51:52,297][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,482][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,698][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:52,999][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,099][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,309][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,404][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,483][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,641][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,747][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:53,941][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,092][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,195][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,313][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:51:54,415][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:00,126][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:25,187][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:25,283][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:52:33,634][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,025][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,058][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:07,092][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:16,043][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 18:53:16,056][WARN ][river.csv                ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting
[2014-02-26 19:04:58,212][INFO ][cluster.metadata         ] [Stiltzkin] [kibana-int] creating index, cause [auto(index api)], shards [5]/[1], mappings []
[2014-02-26 19:04:58,871][INFO ][cluster.metadata         ] [Stiltzkin] [kibana-int] update_mapping [dashboard] (dynamic)
[2014-02-26 19:05:43,411][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] File londonbss.csv.processing, processed lines 5061972
[2014-02-26 19:05:43,426][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] next run waiting for 5m
[2014-02-26 19:10:43,416][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/london_bss/data, .*\.csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id)
[2014-02-26 19:10:43,416][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Going to process files {}
[2014-02-26 19:10:43,416][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] next run waiting for 5m
[2014-02-26 19:15:43,405][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/london_bss/data, .*\.csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id)
[2014-02-26 19:15:43,405][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] Going to process files {}
[2014-02-26 19:15:43,405][INFO ][river.csv                ] [Stiltzkin] [csv][csv_river] next run waiting for 5m

(this one is from the complete CSV file containing more than 5Million records)

I do have a column that can play the role of an id. It's called rental_id (which I map to an integer). Sorry for the stupid question but: “How can I indicate that it's the id field in the river configuration?”

xxBedy commented 10 years ago

I update documentation.

You must use: "field_id" : "rental_d" and use header line in CSV file or set "first_line_is_header" : "true" and fill "fields" : [ "column1", "column2", "column3", "column4" ], property

Bedy

    1. 2014 v 19:24, Mohamed Khalil El Mahrsi notifications@github.com:

Initially I was running Elasticsearch 1.0.0 on a Ubuntu 13.10 VM with 2Gb RAM and with JVM 1.7.0_51 then I switched to testing on Mac OS X 10.9.1 with 8Gb RAM and JVM 1.6.0_65 (in both cases, the processor is an Intel® Core™ i7-3540M CPU @ 3.00GHz × 2 ).

The log file does not mention any errors:

[2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files /Users/Khalil/Documents/london_bss/data/londonbss.csv [2014-02-26 18:51:50,804][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Processing file londonbss.csv [2014-02-26 18:51:51,044][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] creating index, cause [auto(bulk api)], shards [5]/[1], mappings [] [2014-02-26 18:51:51,967][INFO ][cluster.metadata ] [Stiltzkin] [london_bss] update_mapping journey_nomapping [2014-02-26 18:51:52,297][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,482][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,698][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:52,999][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,099][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,309][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,404][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,483][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,641][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,747][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:53,941][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,195][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,313][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:51:54,415][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:00,126][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:25,187][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:25,283][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:52:33,634][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,025][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,058][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:07,092][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:16,043][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 18:53:16,056][WARN ][river.csv ] [Stiltzkin] [csv][csv_river] ongoing bulk, 10 crossed threshold 10, waiting [2014-02-26 19:04:58,212][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] creating index, cause [auto(index api)], shards [5]/[1], mappings [] [2014-02-26 19:04:58,871][INFO ][cluster.metadata ] [Stiltzkin] [kibana-int] update_mapping dashboard [2014-02-26 19:05:43,411][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] File londonbss.csv.processing, processed lines 5061972 [2014-02-26 19:05:43,426][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/londonbss/data, ..csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id) [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {} [2014-02-26 19:10:43,416][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Using configuration: org.elasticsearch.river.csv.Configuration(/Users/Khalil/Documents/londonbss/data, ..csv$, true, [rental_id, billable_duration, duration, unique_id_customer_record_number, subscription_id, bike_id, end_date, endstation_id, endstation_logical_terminal, endstation_name, endstationpriority_id, start_date, startstation_id, startstation_logical_terminal, startstation_name, startstationpriority_id, endhourcategory_id, starthourcategory_id, bikeusertype_id, weekday, hourofday, day, hourandday, isweek, hourofdayret, month], 5m, london_bss, journey_nomapping, 100, \, ", ;, -24, 10, id) [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] Going to process files {} [2014-02-26 19:15:43,405][INFO ][river.csv ] [Stiltzkin] [csv][csv_river] next run waiting for 5m (this one is from the complete CSV file containing more than 5Million records)

I do have a column that can play the role of an id. It's called rental_id (which I map to an integer). Sorry for the stupid question but: “How can I indicate that it's the id field in the river configuration?”

— Reply to this email directly or view it on GitHub.

kelmahrsi commented 10 years ago

I just did that and it didn't solve the problem (and the log file still doesn't mention the slightest error).

vtajzich commented 10 years ago

@mahrsi could you provide us with test CSV file which causes the issue?

kelmahrsi commented 10 years ago

@tajzivit you can download a test CSV file (containing the first 250000 records of the original) from here: https://dl.dropboxusercontent.com/u/8280898/data.csv

I tested with the test file and the problem still occurs (only 27462 records imported in Elasticsearch).

vtajzich commented 10 years ago

Thanks for example data file.

vtajzich commented 10 years ago

Please, try to build sources from master and run it again. It should be OK, when some error occurs it will skip the line or file. You will see it in log.

kelmahrsi commented 10 years ago

Thank you very much! I tried the new version and indeed the file imported correctly!

2bedom commented 10 years ago

i got same issue but my csv has got 4,600,000 rows and the import stops around 750000... i got the latest plugin ... i played around with bulk_size.. the less i took the more i can import... but the difference is about +-2000 more or less.. never the less 750000 to 4,2millions ...is alot ;(

vtajzich commented 10 years ago

can you provide us with sample file? Please, create a new issue.