AgileWorksOrg / elasticsearch-river-csv

CSV river for ElasticSearch
Apache License 2.0
91 stars 45 forks source link

Getting java.lang.ArrayIndexOutOfBoundsException Error after a while #7

Closed clemsos closed 10 years ago

clemsos commented 10 years ago

I am processing large CSV files with the river to index it into ES. After a few bulks, I got the following Error :

[2014-01-28 08:08:15,712][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,171][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,455][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:17,061][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:21,019][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:22,108][ERROR][river.csv                ] 
java.lang.ArrayIndexOutOfBoundsException

This may be realted to my csv that is mostly Chinese text and there could be few characters that wont be UTF-8 compliant. However, is there a logic that can be implement to prevent this to make everything fail ?

Thanks !

xxBedy commented 10 years ago

Hi,

Can you attach full stacktrace ? I think that this error occurred because some line has different number of columns than other.

Thank you Bedy

    1. 2014 v 8:16, Clément Renaud notifications@github.com:

I am processing large CSV files with the river to index it into ES. After a few bulks, I got the following Error :

[2014-01-28 08:08:15,712][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting [2014-01-28 08:08:16,171][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting [2014-01-28 08:08:16,455][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting [2014-01-28 08:08:17,061][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting [2014-01-28 08:08:21,019][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting [2014-01-28 08:08:22,108][ERROR][river.csv ] java.lang.ArrayIndexOutOfBoundsException

This may be realted to my csv that is mostly Chinese text and there could be few characters that wont be UTF-8 compliant. However, is there a logic that can be implement to prevent this to make everything fail ?

Thanks !

— Reply to this email directly or view it on GitHub.

clemsos commented 10 years ago

Thanks for the quick answer.

How do I log the full stacktrace?

I was using the build of elastic-river-csv downloaded from this thread (first build from the github rep didn't work) https://groups.google.com/d/msg/elasticsearch/bvHoZvvTjNY/sgzzsm39ZnAJ

Now, I have rebuilt from the github rep and make it work. I get this log :

[2014-01-28 12:58:08,952][INFO ][cluster.metadata         ] [Magnum, Moses] [weiboscope] deleting index
[2014-01-28 12:58:15,786][INFO ][cluster.metadata         ] [Magnum, Moses] [[_river]] remove_mapping [csv]
[2014-01-28 12:58:15,789][INFO ][river.csv                ] [Magnum, Moses] [csv][csv] closing csv stream river
[2014-01-28 12:58:15,789][ERROR][river.csv                ] Error during waiting.
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at org.elasticsearch.river.csv.CSVRiver.delay(CSVRiver.java:125)
    at org.elasticsearch.river.csv.CSVRiver.access$700(CSVRiver.java:51)
    at org.elasticsearch.river.csv.CSVRiver$CSVConnector.run(CSVRiver.java:199)
    at java.lang.Thread.run(Thread.java:722)
[2014-01-28 12:58:32,083][INFO ][cluster.metadata         ] [Magnum, Moses] [_river] update_mapping [csv] (dynamic)
[2014-01-28 12:58:32,084][INFO ][river.routing            ] [Magnum, Moses] no river _meta document found, retrying in 1000 ms
[2014-01-28 12:58:33,088][INFO ][river.csv                ] [Magnum, Moses] [csv][csv] creating csv stream river for [/home/clemsos/Dev/mitras/data/] with pattern [.*\.csv$]
[2014-01-28 13:09:54,729][INFO ][river.csv                ] [Magnum, Moses] [csv][csv] starting csv stream
[2014-01-28 13:09:54,738][INFO ][river.csv                ] [Magnum, Moses] [csv][csv] Processing file week11.csv
[2014-01-28 13:09:54,751][INFO ][cluster.metadata         ] [Magnum, Moses] [_river] update_mapping [csv] (dynamic)
[2014-01-28 13:09:58,538][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:09:58,897][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:09:59,295][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:09:59,574][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:09:59,989][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:10:00,418][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:10:01,207][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:10:06,635][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:10:07,031][WARN ][river.csv                ] [Magnum, Moses] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 13:10:11,445][ERROR][river.csv                ] 10
java.lang.ArrayIndexOutOfBoundsException: 10
    at org.elasticsearch.river.csv.CSVRiver$CSVConnector.processFile(CSVRiver.java:241)
    at org.elasticsearch.river.csv.CSVRiver$CSVConnector.run(CSVRiver.java:193)
    at java.lang.Thread.run(Thread.java:722)

Also, I just looked into the data imported by the river and it seems like Chinese character aren't supported. It is all UTF-8 so this is quite weird...

http://localhost:9200/_search?q=*&pretty

{
_index: "weiboscope",
_type: "tweet",
_id: "ddee43df-0b2d-4b15-9548-bbb6f4ac6942",
_score: 1,
_source: {
mid: "mZwXiXBkvi",
retweeted_status_mid: "mnzjbNU2E8",
uid: "uII5OUMAO",
retweeted_uid: "",
source: "iPad���������",
image: "0",
text: "������������������������������������������ //@uMLLBRATV��� ������ //@uW0EVQ1OR��� ������������������������������������������������������������",
geo: "",
created_at: "2012-03-05 00:23:24",
deleted_last_seen: "",
permission_denied: ""
}
},
xxBedy commented 10 years ago

Hi,

What's your OS, localization, etc ? try to set JAVA_OPTS="-Dfile.encoding=UTF8" to environment variable or elasticsearch script

Exception is thrown on this line: builder.field((String) fieldName, nextLine[position++]);

The problem is that processed line has wrong number of columns as I said in previous email.

Bedy

    1. 2014 v 13:08, Clément Renaud notifications@github.com:

Thanks for the quick answer.

How do I log the full stacktrace?

I was using the build of elastic-river-csv downloaded from this thread (first build from the github rep didn't work) https://groups.google.com/d/msg/elasticsearch/bvHoZvvTjNY/sgzzsm39ZnAJ

Now, I have rebuilt from the github rep and make it work. I get this log :

[2014-01-28 12:58:08,952][INFO ][cluster.metadata ] [Magnum, Moses] [weiboscope] deleting index [2014-01-28 12:58:15,786][INFO ][cluster.metadata ] [Magnum, Moses] [[_river]] remove_mapping [csv] [2014-01-28 12:58:15,789][INFO ][river.csv ] [Magnum, Moses] [csv][csv] closing csv stream river [2014-01-28 12:58:15,789][ERROR][river.csv ] Error during waiting. java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.elasticsearch.river.csv.CSVRiver.delay(CSVRiver.java:125) at org.elasticsearch.river.csv.CSVRiver.access$700(CSVRiver.java:51) at org.elasticsearch.river.csv.CSVRiver$CSVConnector.run(CSVRiver.java:199) at java.lang.Thread.run(Thread.java:722) [2014-01-28 12:58:32,083][INFO ][cluster.metadata ] [Magnum, Moses] [_river] update_mapping csv [2014-01-28 12:58:32,084][INFO ][river.routing ] [Magnum, Moses] no river _meta document found, retrying in 1000 ms [2014-01-28 12:58:33,088][INFO ][river.csv ] [Magnum, Moses] [csv][csv] creating csv stream river for [/home/clemsos/Dev/mitras/data/] with pattern [.*.csv$] [2014-01-28 12:58:33,088][INFO ][river.csv ] [Magnum, Moses] [csv][csv] starting csv stream [2014-01-28 12:58:33,090][INFO ][river.csv ] [Magnum, Moses] [csv][csv] Processing file week10.csv [2014-01-28 12:58:33,098][INFO ][cluster.metadata ] [Magnum, Moses] [_river] update_mapping csv [2014-01-28 12:58:33,481][INFO ][cluster.metadata ] [Magnum, Moses] [weiboscope] creating index, cause [auto(bulk api)], shards [5]/[1], mappings [] [2014-01-28 12:58:33,739][ERROR][river.csv ] 10 java.lang.ArrayIndexOutOfBoundsException: 10 at org.elasticsearch.river.csv.CSVRiver$CSVConnector.processFile(CSVRiver.java:241) at org.elasticsearch.river.csv.CSVRiver$CSVConnector.run(CSVRiver.java:193) at java.lang.Thread.run(Thread.java:722) [2014-01-28 12:58:34,802][INFO ][cluster.metadata ] [Magnum, Moses] [weiboscope] update_mapping tweet Also, I just looked into the data imported by the river and it seems like Chinese character aren't supported. It is all UTF-8 so this is quite weird...

{ index: "weiboscope", _type: "tweet", _id: "ddee43df-0b2d-4b15-9548-bbb6f4ac6942", _score: 1, _source: { mid: "mZwXiXBkvi", retweetedstatus_mid: "mnzjbNU2E8", uid: "uII5OUMAO", retweeted_uid: "", source: "iPad���������", image: "0", text: "������������������������������������������ //@uMLLBRATV��� ������ //@uW0EVQ1OR��� ������������������������������������������������������������", geo: "", created_at: "2012-03-05 00:23:24", deleted_last_seen: "", permission_denied: "" } },

— Reply to this email directly or view it on GitHub.

vtajzich commented 10 years ago

@clemsos is it solved?

clemsos commented 10 years ago

Yes, thanks for the help !