Open civitaspo opened 7 years ago
@civitaspo Thanks for reporting!
Hmm, I'm not sure why they have been '\0'
, but I thought they can be just -1
, -2
or like that since Java's char
is signed 2-bytes. What do you think? @muga
@dmikurube @civitaspo We had used '\0' as CsvTokenizer's control character for end of line without thinking deeply. NO_QUOTE and NO_ESCAPE as well. I think, we can change those values with negative numbers as @dmikurube mentioned.
diff --git a/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java b/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
index d5d2583..0683f87 100644
--- a/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
+++ b/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java
@@ -21,9 +21,9 @@ public class CsvTokenizer
BEGIN, VALUE, QUOTED_VALUE, AFTER_QUOTED_VALUE, FIRST_TRIM, LAST_TRIM_OR_VALUE,
}
- private static final char END_OF_LINE = '\0';
- static final char NO_QUOTE = '\0';
- static final char NO_ESCAPE = '\0';
+ private static final char END_OF_LINE = (char)-1;
+ static final char NO_QUOTE = (char)-2;
+ static final char NO_ESCAPE = (char)-3;
private final char delimiterChar;
private final String delimiterFollowingString;
How about it?
@muga Thanks. Actually That is not a perfect solution (strings may contain 0xffff as well), but the situation may get a bit better by that.
To fix it better, we need to take a deeper look inside the entire CsvTokenizer... I hope adding a flag solves the problem, but I'm not perfectly sure.
I guess it is still a reasonable assumption that "CSV" should not contain '\0''
(seeing RFC4180), but it's worth solving since CSVs are so varied.
Yea, or we may be able to extract the value as config option parameter and users configure any value. It is '\0'
by default.
Hi maintainers,
We sometimes encounter a problem CSV Parser cannot parse string containing
\0
. This seems to be caused by CSV Parser judging that\0
is the end of a sentence.https://github.com/embulk/embulk/blob/45aa8fc1520cb43f9285b161cb1d033c90cc33fd/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java#L24
How can we avoid this problem? or Could you solve the problem?
See the following code to reproduce the problem.