ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.11k stars 440 forks source link

Add support for multi-line string fields #87

Closed chris0x44 closed 4 years ago

chris0x44 commented 4 years ago

Sorry about the additional edits (trailing whitespaces) that make this a bit hard to read.

ben-strasser commented 4 years ago

Thanks for the patch. However, I wont merge. There are number of reasons. I only list the two most important ones.

I did a quick benchmark on the LineReader. The main code is:

int main(){
        {
                std::ofstream out("testfile");
                for(int i=0; i<100000000; ++i){
                        out << i;
                        for(int j=0; j<10; ++j)
                                out << ',' << j;
                        out << '\n';
                }
        }
        {
                long long timer = -get_micro_time();
                int n=0;
                old_::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout << n << " " <<  timer << std::endl;
        }
        {
                long long timer = -get_micro_time();
                int n=0;
                new_::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout << n << " " <<  timer << std::endl;
        }
}

I test a large CSV file with only numbers and no string at all. This is the main type of CSV files that this parser was written for. Speed is one of the most important metrics here. The result is:

100000000 54035496 100000000 139589415

This means that applications that gain nothing from changes with quoted strings, as they do not use them, pay a with a 2x running time slow down. This is not acceptable.

The second problem is that your patch does not support the customization points supported by the current code. For example, in the current code, you can vastly customize the way quotes are used. single-quote-string with semi-colon separators? No problem:

CSVReader<4, trim_chars<' '>, double_quote_escape<';','\''> >

This no longer works with your code.

chris0x44 commented 4 years ago

I see the problem. Would you be interested in a version that takes this into account?

Is my understanding correct that you "replace" the double quote escape sequence with the character provided in double_quote_escape so that multiline rules would then be used with this character?

ben-strasser commented 4 years ago

Most users do not have strings at all. They have CSV files with numbers only. Some users have strings which are identifiers and do not need any quoting rules at all. They use something along the lines:

CSVReader<3, trim_chars<' '>, no_quote_escape<'\t'>>

Few have string which can contain the CSV-separator char. These users have to use something along the lines of:

CSVReader<4, trim_chars<' '>, double_quote_escape<',','\"'> >

Very few users have the additional requirement that string can contain non-escaped newlines.

Using double_quote_escape instead of no_quote_escape has a significant impact on running time performance. Do not use double_quote_escape unless you have to! Allowing for non-escaped newlines has an additional huge impact as my previous post shows.

It is unacceptable to expect any breaking interface change or running time penalty for the many to make the very few happy.

The library is not limited to no_quote_escape and double_quote_escape. These are policy classes that allow significantly more flexibility. They have the followings structure:

struct my_own_quoting_rules{
                static const char*find_next_column_end(const char*col_begin){

                }

                static void unescape(char*&, char*&){

                }
};

You can use this to get \"-style or &quot;-style escaping or something very funky. Originally, I planned on providing prewritten policies for these cases, but I never got round to do it. I do not expect many people to use this feature. However, it is there and removing it would be a breaking change.

In this design, there is no "double quote escape sequence" concept. Some policies, such as double_quote_escape, have such a concept but not all do. For example, no_quote_escape does not. Nor does CSVReader or LineReader.

The design is very flexible with one major restriction: The newline character must not appear in a line. This design choice was done for three reasons:

I do not see how all these requirements can be made to work together. I especially do not see how it can be made to work with the current library design. That does not mean that I think that it is impossible.

What I also do not understand is people that write CSV-encoders that can successfully replace " by "" but fail at replacing \n with some other escape sequence. If they can be bothered to do one why can't they do both?

It is my firm opinion, that having CSV files with newline characters is begging for trouble. The only reasonable option is to fix the CSV file.

chris0x44 commented 4 years ago

Yes, the performance hit was an oversight on my side. Sorry for that. By the way if you're looking for a speed up of the hot loop in LineReader, I'll send a pull request that should improve loop time by ~25%.

Regarding multiline: Not trying to convince you, just some context. ;) I originally assumed that the LineReader was supposed to read lines in way compliant to RFC 4180 and the policies were to decide how to split things up. Granted this introduces policy detail into the LineReader, it's understandable to reject this from a design perspective. Especially since it's not needed for most policies. Perhaps an entry in the FAQ, that multiline is not supported, would be helpful. Trying to read files containing line breaks within fields will usually give a too_many_columns error.

Sidenote: To your point of escaping double quotes but not newlines, if people were writing the CSV themselves I would agree. But for example in my case, I have to live with CSV exported from Excel, which encodes newlines in cells by enclosing the field in double quotes.

ben-strasser commented 4 years ago

Hi, thanks for the patch anyway. :)

I just noticed, that I forgot to use -O3 in my last benchmark, so the numbers are meaningless. However, the other comments still stand.

I'm currently benchmarking your other PR. I'll post my numbers there.

I know that RFC 4180 says that plain newlines in quoted strings are ok. However, there are so many CSVs out there that ignore it that I honestly do not care too much about it. The no-newline-in-quoted-string issue comes up every half year or so. There are so many variants that people want that are outside of RFC 4180. For example, one person wanted newlines to work in non-quoted strings.

Sidenote with respect to Excel: Apparently it (or at least some versions) only handles newlines in quoted strings "correctly" when not combined with UTF-8: https://stackoverflow.com/questions/1241220/generating-csv-file-for-excel-how-to-have-a-newline-inside-a-value