embulk / embulk-standards

Apache License 2.0
0 stars 0 forks source link

Cannot load text correctly by CsvParserPlugin if it includes `\0`. #38

Open civitaspo opened 7 years ago

civitaspo commented 7 years ago

Hi maintainers,

We sometimes encounter a problem CSV Parser cannot parse string containing \0 . This seems to be caused by CSV Parser judging that \0 is the end of a sentence.

https://github.com/embulk/embulk/blob/45aa8fc1520cb43f9285b161cb1d033c90cc33fd/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java#L24

How can we avoid this problem? or Could you solve the problem?

See the following code to reproduce the problem.

$ ruby -e '5.times {|i| puts "#{i}\taaaaa\0aaa#{i}a\tname#{i}"}' > data.tsv
$ cat data.tsv  # the below text includes `\0`
0   aaaaaaaa0a  name0
1   aaaaaaaa1a  name1
2   aaaaaaaa2a  name2
3   aaaaaaaa3a  name3
4   aaaaaaaa4a  name4

$ cat config.yml
in:
  type: file
  path_prefix: ./data.tsv
  parser:
    type: csv
    delimiter: "\t"
    skip_header_lines: 0
    columns:
      - { name: id, type: long }
      - { name: description, type: string }
      - { name: name, type: string }
    stop_on_invalid_record: true
out:
  type: stdout

$ embulk run config.yml
2017-05-16 13:24:04.405 +0900: Embulk v0.8.21
2017-05-16 13:24:06.004 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'data.tsv'
2017-05-16 13:24:06.006 +0900 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-05-16 13:24:06.010 +0900 [INFO] (0001:transaction): Loading files [data.tsv]
2017-05-16 13:24:06.055 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2017-05-16 13:24:06.062 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2017-05-16 13:24:06.121 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
org.embulk.exec.PartialExecutionException: org.embulk.spi.DataException: Invalid record at line 1: 0    aaaaaaaa0a  name0
    at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:373)
    at org.embulk.exec.BulkLoader.doRun(org/embulk/exec/BulkLoader.java:591)
    at org.embulk.exec.BulkLoader.access$000(org/embulk/exec/BulkLoader.java:33)
    at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:389)
    at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:385)
    at org.embulk.spi.Exec.doWith(org/embulk/spi/Exec.java:25)
    at org.embulk.exec.BulkLoader.run(org/embulk/exec/BulkLoader.java:385)
    at org.embulk.EmbulkEmbed.run(org/embulk/EmbulkEmbed.java:180)
    at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)
    at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)
    at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:314)
    at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/runner.rb:84)
    at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_run.rb:269)
    at RUBY.<main>(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_main.rb:2)
    at org.jruby.Ruby.runInterpreter(org/jruby/Ruby.java:850)
    at org.jruby.Ruby.loadFile(org/jruby/Ruby.java:2976)
    at org.jruby.RubyKernel.requireCommon(org/jruby/RubyKernel.java:963)
    at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:956)
    at org.jruby.RubyKernel$INVOKER$s$1$0$require19.call(org/jruby/RubyKernel$INVOKER$s$1$0$require19.gen)
    at RUBY.(root)(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1)
    at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.invokeOther66:require(Users/takahiro_dot_nakayama/bin/embulk/embulk/command/file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
    at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.<main>(file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
    at java.lang.invoke.MethodHandle.invokeWithArguments(java/lang/invoke/MethodHandle.java:627)
    at org.jruby.Ruby.runScript(org/jruby/Ruby.java:834)
    at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:749)
    at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:767)
    at org.jruby.Ruby.runFromMain(org/jruby/Ruby.java:580)
    at org.jruby.Main.doRunFromMain(org/jruby/Main.java:425)
    at org.jruby.Main.internalRun(org/jruby/Main.java:313)
    at org.jruby.Main.run(org/jruby/Main.java:242)
    at org.jruby.Main.main(org/jruby/Main.java:204)
    at org.embulk.cli.Main.main(org/embulk/cli/Main.java:23)
Caused by: org.embulk.spi.DataException: Invalid record at line 1: 0    aaaaaaaa0a  name0
    at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:368)
    at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:156)
    at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.runInputTask(LocalExecutorPlugin.java:294)
    at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.access$000(LocalExecutorPlugin.java:212)
    at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:257)
    at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.embulk.standards.CsvTokenizer$TooFewColumnsException: Too few columns
    at org.embulk.standards.CsvTokenizer.nextColumn(CsvTokenizer.java:168)
    at org.embulk.standards.CsvTokenizer.nextColumnOrNull(CsvTokenizer.java:379)
    at org.embulk.standards.CsvParserPlugin$1.nextColumn(CsvParserPlugin.java:346)
    at org.embulk.standards.CsvParserPlugin$1.stringColumn(CsvParserPlugin.java:302)
    at org.embulk.spi.Column.visit(Column.java:58)
    at org.embulk.spi.Schema.visitColumns(Schema.java:81)
    at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:259)
    ... 9 more

Error: org.embulk.spi.DataException: Invalid record at line 1: 0    aaaaaaaa0a  name0
dmikurube commented 7 years ago

@civitaspo Thanks for reporting!

Hmm, I'm not sure why they have been '\0', but I thought they can be just -1, -2 or like that since Java's char is signed 2-bytes. What do you think? @muga

muga commented 7 years ago

@dmikurube @civitaspo We had used '\0' as CsvTokenizer's control character for end of line without thinking deeply. NO_QUOTE and NO_ESCAPE as well. I think, we can change those values with negative numbers as @dmikurube mentioned.

diff --git a/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java b/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
index d5d2583..0683f87 100644
--- a/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
+++ b/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
@@ -21,9 +21,9 @@ public class CsvTokenizer
         BEGIN, VALUE, QUOTED_VALUE, AFTER_QUOTED_VALUE, FIRST_TRIM, LAST_TRIM_OR_VALUE,
     }

-    private static final char END_OF_LINE = '\0';
-    static final char NO_QUOTE = '\0';
-    static final char NO_ESCAPE = '\0';
+    private static final char END_OF_LINE = (char)-1;
+    static final char NO_QUOTE = (char)-2;
+    static final char NO_ESCAPE = (char)-3;

     private final char delimiterChar;
     private final String delimiterFollowingString;

How about it?

dmikurube commented 7 years ago

@muga Thanks. Actually That is not a perfect solution (strings may contain 0xffff as well), but the situation may get a bit better by that.

To fix it better, we need to take a deeper look inside the entire CsvTokenizer... I hope adding a flag solves the problem, but I'm not perfectly sure.

dmikurube commented 7 years ago

I guess it is still a reasonable assumption that "CSV" should not contain '\0'' (seeing RFC4180), but it's worth solving since CSVs are so varied.

muga commented 7 years ago

Yea, or we may be able to extract the value as config option parameter and users configure any value. It is '\0' by default.