brianmhess / cassandra-loader

Delimited file loader for Cassandra
Apache License 2.0
197 stars 93 forks source link

Using csvp parser.parseNext() instead of reader.readLine() #69

Closed mj-jadhav closed 6 years ago

mj-jadhav commented 7 years ago

Why not use Univocity Parser's splitter instead of readLine. https://github.com/al3xandru/cassandra-loader/blob/parser/src/main/java/com/datastax/loader/CqlDelimLoadTask.java#L191

A lot of parserSettings doesn't work because of this. For ex. following is one row in my CSV:

a,b,c,"d
e",f

Instead of making it a single record your tool makes it two rows. Splitting lines outside of the parser itself not only breaks anything within quotes, it's also WAY slower (like 3-4 times slower) and generates twice the garbage.

Please fix this. I made a temporary hack for my use case b/c it has a lot of abstractions.

brianmhess commented 7 years ago

I have wanted to do this, however, one of the primary features of cassandra-loader is the ability to log all the errors to a error file that can be examined later. To do that, I need to be able to get the original line/lines from the file to be able to properly log the error. That is not something that Univocity allows for. I did look into modifying the Univocity parser to support this, but stopped short. If you have an example of how you accomplished this, I'd be very interested to look.