Add support for multi-line string fields

chris0x44 commented 4 years ago

Sorry about the additional edits (trailing whitespaces) that make this a bit hard to read.

ben-strasser commented 4 years ago

Thanks for the patch. However, I wont merge. There are number of reasons. I only list the two most important ones.

I did a quick benchmark on the LineReader. The main code is:

int main(){
        {
                std::ofstream out("testfile");
                for(int i=0; i<100000000; ++i){
                        out << i;
                        for(int j=0; j<10; ++j)
                                out << ',' << j;
                        out << '\n';
                }
        }
        {
                long long timer = -get_micro_time();
                int n=0;
                old_::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout << n << " " <<  timer << std::endl;
        }
        {
                long long timer = -get_micro_time();
                int n=0;
                new_::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout << n << " " <<  timer << std::endl;
        }
}

I test a large CSV file with only numbers and no string at all. This is the main type of CSV files that this parser was written for. Speed is one of the most important metrics here. The result is:

100000000 54035496 100000000 139589415

This means that applications that gain nothing from changes with quoted strings, as they do not use them, pay a with a 2x running time slow down. This is not acceptable.

The second problem is that your patch does not support the customization points supported by the current code. For example, in the current code, you can vastly customize the way quotes are used. single-quote-string with semi-colon separators? No problem:

CSVReader<4, trim_chars<' '>, double_quote_escape<';','\''> >

This no longer works with your code.

chris0x44 commented 4 years ago

I see the problem. Would you be interested in a version that takes this into account?

Is my understanding correct that you "replace" the double quote escape sequence with the character provided in double_quote_escape so that multiline rules would then be used with this character?

ben-strasser commented 4 years ago

Most users do not have strings at all. They have CSV files with numbers only. Some users have strings which are identifiers and do not need any quoting rules at all. They use something along the lines:

CSVReader<3, trim_chars<' '>, no_quote_escape<'\t'>>

Few have string which can contain the CSV-separator char. These users have to use something along the lines of:

CSVReader<4, trim_chars<' '>, double_quote_escape<',','\"'> >

Very few users have the additional requirement that string can contain non-escaped newlines.

Using double_quote_escape instead of no_quote_escape has a significant impact on running time performance. Do not use double_quote_escape unless you have to! Allowing for non-escaped newlines has an additional huge impact as my previous post shows.

It is unacceptable to expect any breaking interface change or running time penalty for the many to make the very few happy.

The library is not limited to no_quote_escape and double_quote_escape. These are policy classes that allow significantly more flexibility. They have the followings structure:

struct my_own_quoting_rules{
                static const char*find_next_column_end(const char*col_begin){

                }

                static void unescape(char*&, char*&){

                }
};

You can use this to get \"-style or "-style escaping or something very funky. Originally, I planned on providing prewritten policies for these cases, but I never got round to do it. I do not expect many people to use this feature. However, it is there and removing it would be a breaking change.

In this design, there is no "double quote escape sequence" concept. Some policies, such as double_quote_escape, have such a concept but not all do. For example, no_quote_escape does not. Nor does CSVReader or LineReader.

The design is very flexible with one major restriction: The newline character must not appear in a line. This design choice was done for three reasons:

The loop in LineReader is really hot. I want to avoid providing customization points here to make it harder for people to shoot themselves into the foot with respect to running time performance.
Handling \n vs \r\n transparently is a huge pain. Getting this right is really difficult. I am certain that there are people that want a transparent translation of \r\n to \n and other people explicitly do not want that. This means that the "very few user" camp is splintered into various subcamps that all expect their particular usecase to just work. Handling this in a sightly different way than expected by the user is an open invitation for subtle bugs. I prefer not supporting it at all.
Having simple line end rules makes producing understandable error messages easier. A missing quote cannot cause a runaway error that spans multiple lines. Having an even number of runaway-missing-quote errors might even lead to a CSV file that seems to be correctly parsed but isn't. This violates the fail-early principle.

I do not see how all these requirements can be made to work together. I especially do not see how it can be made to work with the current library design. That does not mean that I think that it is impossible.

What I also do not understand is people that write CSV-encoders that can successfully replace " by "" but fail at replacing \n with some other escape sequence. If they can be bothered to do one why can't they do both?

It is my firm opinion, that having CSV files with newline characters is begging for trouble. The only reasonable option is to fix the CSV file.

chris0x44 commented 4 years ago

Yes, the performance hit was an oversight on my side. Sorry for that. By the way if you're looking for a speed up of the hot loop in LineReader, I'll send a pull request that should improve loop time by ~25%.

Regarding multiline: Not trying to convince you, just some context. ;) I originally assumed that the LineReader was supposed to read lines in way compliant to RFC 4180 and the policies were to decide how to split things up. Granted this introduces policy detail into the LineReader, it's understandable to reject this from a design perspective. Especially since it's not needed for most policies. Perhaps an entry in the FAQ, that multiline is not supported, would be helpful. Trying to read files containing line breaks within fields will usually give a too_many_columns error.

Sidenote: To your point of escaping double quotes but not newlines, if people were writing the CSV themselves I would agree. But for example in my case, I have to live with CSV exported from Excel, which encodes newlines in cells by enclosing the field in double quotes.

ben-strasser commented 4 years ago

Hi, thanks for the patch anyway. :)

I just noticed, that I forgot to use -O3 in my last benchmark, so the numbers are meaningless. However, the other comments still stand.

I'm currently benchmarking your other PR. I'll post my numbers there.

I know that RFC 4180 says that plain newlines in quoted strings are ok. However, there are so many CSVs out there that ignore it that I honestly do not care too much about it. The no-newline-in-quoted-string issue comes up every half year or so. There are so many variants that people want that are outside of RFC 4180. For example, one person wanted newlines to work in non-quoted strings.

Sidenote with respect to Excel: Apparently it (or at least some versions) only handles newlines in quoted strings "correctly" when not combined with UTF-8: https://stackoverflow.com/questions/1241220/generating-csv-file-for-excel-how-to-have-a-newline-inside-a-value

ben-strasser / fast-cpp-csv-parser

Add support for multi-line string fields #87