AgileWorksOrg / elasticsearch-river-csv

CSV river for ElasticSearch
Apache License 2.0
91 stars 45 forks source link

upload hangs after about 470000 rows #34

Closed ghost closed 10 years ago

ghost commented 10 years ago

Hi,

I am using elasticsearch-1.3.2-1.noarch on a 2 node cluster And the ALL.zip from http://fec.gov/disclosurep/PDownload.do

And the following curl statement to upload: curl -XPUT localhost:9200/_river/my_csv_river/_meta -d ' { "type" : "csv", "csv_file" : { "folder" : "/u01/app/div/temp", "first_line_is_header":"true" }, "index" : { "index":"contributions", "bulk_size":100000, "bulk_threshold":10, "type":"csv_type" } }' The unzipped file has about 5M rows. After about 470000 stops and seems to hang. But the java process is using 1 cpu:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20180 elastics 20 0 2044m 1.0g 22m S 99.6 34.4 15:40.89 java

is this because of the analyzing of the columns? How can I improve this?

Regards Hans-Peter

vtajzich commented 10 years ago

You are right. It hangs about 500k records. I will take a look.

vtajzich commented 10 years ago

On line 477456 it's being caught in this method (OpenCSV library). It still keep reading lines and not return it.


public String[] readNext() throws IOException {

        String[] result = null;
        do {
            String nextLine = getNextLine();
            if (!hasNext) {
                return result; // should throw if still pending?
            }
            String[] r = parser.parseLineMulti(nextLine);
            if (r.length > 0) {
                if (result == null) {
                    result = r;
                } else {
                    String[] t = new String[result.length+r.length];
                    System.arraycopy(result, 0, t, 0, result.length);
                    System.arraycopy(r, 0, t, result.length, r.length);
                    result = t;
                }
            }
        } while (parser.isPending());
        return result;
    }
vtajzich commented 10 years ago

Example file has each record on multiple lines. After 477456 records (notice: not lines but records) are processed there is a line with wrong number of escape characters (").

Please, check your file.

ghost commented 10 years ago

Ok thanks I will have a look.