Closed slandelle closed 7 years ago
I'll see if I can have a look, but that is kind of not really csv and more json like.
@arnaudroger Thanks for your answer.
Indeed, RFC4180 doesn't have escaping char, except:
chapt 2.7 If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"
Then, even RFC4180 recognizes (it has Informational status) there are tons of different CSV interpretations in the wild, as escaping char is a common feature (supported in Jackson and Apache commons-csv).
@slandelle would you expect the unescape to handle \n\r\t to be transformed to their equivalent character? I suppose so. I'm gonna use the https://en.wikipedia.org/wiki/Escape_character#JavaScript apart from \0 because it depends on what follows.
Hi @arnaudroger
Thanks, will give it a try!
I also got a few perf opt to go in tonight
Cool, looking forward to testing them :)
@slandelle just pushed 3.14.1 with some optimisation, seems to deliver close to on par perf with jackson csv for the cookie file. for a more classic csv it still outperform.
Thanks! I'll give it a try tomorrow, it hasn't been sync'ed on central yet.
for a more classic csv it still outperform
Could you share some samples, please?
I have a benchmark here https://github.com/arnaudroger/mapping-benchmark/tree/master/sfm-csv based on the http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz The cookie file is tricky because you might be better just doing a readline with a BufferedReader that would actually be a good comparison point.
I've built my own benchmark class based on JMH (I'll clean it up and share):
[info] Benchmark Mode Cnt Score Error Units
[info] CsvTest.readJackson_cookies avgt 6 812,681 ± 21,115 ms/op
[info] CsvTest.readJackson_worldcitiespop avgt 6 1697,252 ± 86,204 ms/op
[info] CsvTest.readSfm_cookies avgt 6 973,035 ± 66,440 ms/op
[info] CsvTest.readSfm_worldcitiespop avgt 6 1414,551 ± 63,846 ms/op
The cookie file is tricky because you might be better just doing a readline with a BufferedReader
That possible. Then, except if I realize BufferedReader wins by a large margin, I don't think I'll go with adding a special path in Gatling's code to try to detect such corner case. I'd rather keep things generic, and things must work out of the box for our users (who might not be proficient developers).
Interesting on my P500 I got 669ms for jackson and 679 for sfm iterator. that's on ubuntu 17.10 on a Intel Xeon.
Here's my benchmark: https://github.com/slandelle/csv-parser-benchmark
thanks, the map to a map might explain the difference what do you expect in the map columnName -> value ?
the map to a map might explain the difference
Indeed. Jackson returns LinkedHashMaps by default, so I didn't realize the additional cost for sfm. Without the mapTo, sfm wins on worldcitiespop by a wide margin, nice! Then, it still loses on cookies.
[info] Benchmark (fileName) Mode Cnt Score Error Units
[info] JacksonTest.read cookies.csv avgt 5 811,971 ± 32,874 ms/op
[info] JacksonTest.read worldcitiespop.txt avgt 5 1881,067 ± 111,855 ms/op
[info] SfmTest.read cookies.csv avgt 5 933,828 ± 33,784 ms/op
[info] SfmTest.read worldcitiespop.txt avgt 5 983,258 ± 79,383 ms/op
what do you expect in the map columnName -> value ?
Actually, I don't really care, as my API requires me to map to Scala Maps anyway. But then, for the sake of my use case, I now need to include this additional map into my benchmarks.
I translated to the following java
CsvParser.MapToDSL<Map> DSL;
ObjectMapper csvMapperToStringArray;
ObjectReader objectReader;
@Setup
public void setUp() {
csvMapperToStringArray =new CsvMapper().disable(MapperFeature.SORT_PROPERTIES_ALPHABETICALLY);
CsvSchema schema = CsvSchema.emptySchema().withHeader().withColumnSeparator(',').withQuoteChar('"');
objectReader = csvMapperToStringArray.readerFor(Map.class).with(schema);
DSL = CsvParser.dsl().separator(',').quote('"').mapTo(Map.class);
}
public Reader getReader() throws FileNotFoundException {
return new FileReader("cookies.csv");
}
@Benchmark
public void sfmRead(Blackhole blackhole) throws IOException {
try(Reader reader = getReader()) {
Iterator<Map> iterator = DSL.iterator(reader);
while (iterator.hasNext()) {
blackhole.consume(iterator.next());
}
}
}
@Benchmark
public void jacksonRead(Blackhole blackhole) throws IOException {
try(Reader reader = getReader()) {
MappingIterator<Map> iterator = objectReader.readValues(reader);
while (iterator.hasNext()) {
blackhole.consume(iterator.next());
}
}
}
but I don't get such a differences, maybe it's the scala impacting the inlining... or a proc difference, I run my latest on a i7 mac book pro
PS: I remove the disableUnescaping, as prob not a good idea, as it won't do any post processing if needed to remove quotes and escape char.
Most likely a proc difference. I ran on a Linux server and results were way better.
makes sense I have seen it not performing as well on my old i5... branch prediction has improved quite a bit. I was thinking, it's not the first time I've seen people mapping rows to map, It probably be worth having a specialised code for that.
Hi,
After a very long delay (sorry), I'm finally investigating switching from Jackson's jackson-dataformat-csv to sfm-csv.
Sadly, I hit a major blocker: IMHO, sfm-csv escaping support is broken.
sfm-csv confuses
quoteChar
and escapeChar, those are 2 different things (at least, they are in Jackson, and it makes sense).quoteChar
(typically"
) is to wrap values, so chars inside the wrapped String are not handled asseparator
, eg"foo;bar";"baz"
escapeChar
(typically\
) is to escape following char so it is not handled as a special one (separator
,quoteChar
andescapeChar
itself), eg"foo\"bar\"baz"
Sadly, this doesn't work as expected as they are considered the same thing (https://github.com/arnaudroger/SimpleFlatMapper/blob/master/sfm-csv/src/main/java/org/simpleflatmapper/csv/CsvParser.java#L547 and https://github.com/arnaudroger/SimpleFlatMapper/blob/master/sfm-csv/src/main/java/org/simpleflatmapper/csv/parser/TextFormat.java#L9), eg:
"foo\"bar\"baz"
:foo"bar"baz
foo\bar\baz
Regards