arnaudroger / SimpleFlatMapper

Fast and Easy mapping from database and csv to POJO. A java micro ORM, lightweight alternative to iBatis and Hibernate. Fast Csv Parser and Csv Mapper
http://simpleflatmapper.org
MIT License
440 stars 76 forks source link

sfm-csv confuses quoteChar and escapeChar #459

Closed slandelle closed 7 years ago

slandelle commented 7 years ago

Hi,

After a very long delay (sorry), I'm finally investigating switching from Jackson's jackson-dataformat-csv to sfm-csv.

Sadly, I hit a major blocker: IMHO, sfm-csv escaping support is broken.

sfm-csv confuses quoteChar and escapeChar, those are 2 different things (at least, they are in Jackson, and it makes sense).

Sadly, this doesn't work as expected as they are considered the same thing (https://github.com/arnaudroger/SimpleFlatMapper/blob/master/sfm-csv/src/main/java/org/simpleflatmapper/csv/CsvParser.java#L547 and https://github.com/arnaudroger/SimpleFlatMapper/blob/master/sfm-csv/src/main/java/org/simpleflatmapper/csv/parser/TextFormat.java#L9), eg: "foo\"bar\"baz":

Regards

arnaudroger commented 7 years ago

I'll see if I can have a look, but that is kind of not really csv and more json like.

slandelle commented 7 years ago

@arnaudroger Thanks for your answer.

Indeed, RFC4180 doesn't have escaping char, except:

chapt 2.7 If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"

Then, even RFC4180 recognizes (it has Informational status) there are tons of different CSV interpretations in the wild, as escaping char is a common feature (supported in Jackson and Apache commons-csv).

arnaudroger commented 7 years ago

@slandelle would you expect the unescape to handle \n\r\t to be transformed to their equivalent character? I suppose so. I'm gonna use the https://en.wikipedia.org/wiki/Escape_character#JavaScript apart from \0 because it depends on what follows.

slandelle commented 7 years ago

Hi @arnaudroger

Thanks, will give it a try!

arnaudroger commented 7 years ago

I also got a few perf opt to go in tonight

slandelle commented 7 years ago

Cool, looking forward to testing them :)

arnaudroger commented 6 years ago

@slandelle just pushed 3.14.1 with some optimisation, seems to deliver close to on par perf with jackson csv for the cookie file. for a more classic csv it still outperform.

slandelle commented 6 years ago

Thanks! I'll give it a try tomorrow, it hasn't been sync'ed on central yet.

for a more classic csv it still outperform

Could you share some samples, please?

arnaudroger commented 6 years ago

I have a benchmark here https://github.com/arnaudroger/mapping-benchmark/tree/master/sfm-csv based on the http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz The cookie file is tricky because you might be better just doing a readline with a BufferedReader that would actually be a good comparison point.

slandelle commented 6 years ago

I've built my own benchmark class based on JMH (I'll clean it up and share):

[info] Benchmark                           Mode  Cnt     Score    Error  Units
[info] CsvTest.readJackson_cookies         avgt    6   812,681 ± 21,115  ms/op
[info] CsvTest.readJackson_worldcitiespop  avgt    6  1697,252 ± 86,204  ms/op
[info] CsvTest.readSfm_cookies             avgt    6   973,035 ± 66,440  ms/op
[info] CsvTest.readSfm_worldcitiespop      avgt    6  1414,551 ± 63,846  ms/op

The cookie file is tricky because you might be better just doing a readline with a BufferedReader

That possible. Then, except if I realize BufferedReader wins by a large margin, I don't think I'll go with adding a special path in Gatling's code to try to detect such corner case. I'd rather keep things generic, and things must work out of the box for our users (who might not be proficient developers).

arnaudroger commented 6 years ago

Interesting on my P500 I got 669ms for jackson and 679 for sfm iterator. that's on ubuntu 17.10 on a Intel Xeon.

slandelle commented 6 years ago

Here's my benchmark: https://github.com/slandelle/csv-parser-benchmark

arnaudroger commented 6 years ago

thanks, the map to a map might explain the difference what do you expect in the map columnName -> value ?

slandelle commented 6 years ago

the map to a map might explain the difference

Indeed. Jackson returns LinkedHashMaps by default, so I didn't realize the additional cost for sfm. Without the mapTo, sfm wins on worldcitiespop by a wide margin, nice! Then, it still loses on cookies.

[info] Benchmark                 (fileName)  Mode  Cnt     Score     Error  Units
[info] JacksonTest.read         cookies.csv  avgt    5   811,971 ±  32,874  ms/op
[info] JacksonTest.read  worldcitiespop.txt  avgt    5  1881,067 ± 111,855  ms/op
[info] SfmTest.read             cookies.csv  avgt    5   933,828 ±  33,784  ms/op
[info] SfmTest.read      worldcitiespop.txt  avgt    5   983,258 ±  79,383  ms/op

what do you expect in the map columnName -> value ?

Actually, I don't really care, as my API requires me to map to Scala Maps anyway. But then, for the sake of my use case, I now need to include this additional map into my benchmarks.

arnaudroger commented 6 years ago

I translated to the following java

    CsvParser.MapToDSL<Map> DSL;
    ObjectMapper csvMapperToStringArray;
    ObjectReader objectReader;
    @Setup
    public void setUp() {
        csvMapperToStringArray =new CsvMapper().disable(MapperFeature.SORT_PROPERTIES_ALPHABETICALLY);
        CsvSchema schema = CsvSchema.emptySchema().withHeader().withColumnSeparator(',').withQuoteChar('"');
        objectReader = csvMapperToStringArray.readerFor(Map.class).with(schema);
        DSL = CsvParser.dsl().separator(',').quote('"').mapTo(Map.class);
    }

    public Reader getReader() throws FileNotFoundException {
        return new FileReader("cookies.csv");
    }
    @Benchmark
    public void sfmRead(Blackhole blackhole) throws IOException {
        try(Reader reader = getReader()) {
            Iterator<Map> iterator = DSL.iterator(reader);
            while (iterator.hasNext()) {
                blackhole.consume(iterator.next());
            }
        }
    }
    @Benchmark
    public void jacksonRead(Blackhole blackhole) throws IOException {
        try(Reader reader = getReader()) {
            MappingIterator<Map> iterator = objectReader.readValues(reader);
            while (iterator.hasNext()) {
                blackhole.consume(iterator.next());
            }
        }
    }

but I don't get such a differences, maybe it's the scala impacting the inlining... or a proc difference, I run my latest on a i7 mac book pro

PS: I remove the disableUnescaping, as prob not a good idea, as it won't do any post processing if needed to remove quotes and escape char.

slandelle commented 6 years ago

Most likely a proc difference. I ran on a Linux server and results were way better.

arnaudroger commented 6 years ago

makes sense I have seen it not performing as well on my old i5... branch prediction has improved quite a bit. I was thinking, it's not the first time I've seen people mapping rows to map, It probably be worth having a specialised code for that.