brianmhess / cassandra-loader

Delimited file loader for Cassandra
Apache License 2.0
197 stars 93 forks source link

Allow for unescaped non-ascii characters (preferably utf8 encoded) #96

Open sjoerd-vogel opened 6 years ago

sjoerd-vogel commented 6 years ago

Currently all output appears to be escaped by org.apache.commons.lang.StringEscapeUtils::escapeJava, which appears to be designed to escape strings for usage in java code (i.e. strings such escaped could be copy-pasted directly into a .java file). Apparently this includes a encoding of non-ascii characters into a \u[codepoint] format. The CSV reader of our choice did not expect this. I propose adding the option to not escape the output in this way. If no double quotes or line breaks appear in the original string, this is perfectly fine when dealing with CSV files.

Additionally, all instances of PrintStream are new-ed using a single-argument constructor, a such constructed PrintStream apparently reduces all non-ascii characters to question marks (?). To allow for utf8 output, these could simply be replaced by three parameter constructors by following substitution:

new PrinstStream(param) -> new PrintStream(param, false, StandardCharsets.UTF_8.name());

where false is the autoflush setting which is false in the single-parameter constructor.

It would be even better to allow type-specific escapes (in the case of CSV: escape double quotes by doubling them), but this could be a separate effort.

I would be happy to create a merge-request.