JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.72k stars 1.06k forks source link

Should a quote character terminate a field (that trails whitespace) #656

Closed kjgorman closed 7 years ago

kjgorman commented 7 years ago

Hi, I've got a question about the interpretation of trailing whitespace in quoted fields.

Basically, what we are observing is a single column "CSV" file that contains quoted data will include trailing whitespace (outside of the double quotes) until either EOL or EOF inside the parsed string.

My reading of https://tools.ietf.org/html/rfc4180#section-2 and specifically point 2.4 has Spaces are considered part of a field and should not be ignored., however the EBNF grammar has escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE, that is all the text data in a field should appear within the quote characters.

Here is some code to reproduce (CsvHelper 2.13.2.0):

using System;
using System.IO;
using System.Linq;
using CsvHelper;

namespace CsvTrailingWhitespace
{
    class Program
    {
        static void Main(string[] args)
        {
            var singleColumn = new StringReader(
@"""Key""
""value"" ");
       //^--- one extra space
            var multiColumn = new StringReader(
@"""Left"",""Right""
""value"" ,""value""");
       //^--- one extra space

            using (var single = new CsvReader(singleColumn))
            {
                var records = single.GetRecords<SingleColumn>();

                Console.WriteLine("Read the following string:");
                Console.WriteLine($"<{records.Single().Key}>");
            }

            using (var multi = new CsvReader(multiColumn))
            {
                var record = multi.GetRecords<MultiColumn>().Single();

                Console.WriteLine("Read the following string:");
                Console.WriteLine($"<{record.Left}><{record.Right}>");
            }

                Console.ReadLine();
        }

        public class SingleColumn
        {
            public string Key { get; set; }
        }

        public class MultiColumn
        {
            public string Left { get; set; }
            public string Right { get; set; }
        }
    }
}

This outputs

Read the following string:
<value >
Read the following string:
<value ><value>

Going by the EBNF, I would sort of expect the closing double quote to close the field and hence write <value> in all cases.

Reading the code it seems that this will "Reads until the field is not quoted and a delimeter is found." but perhaps it should just be "Reads until the field is not quoted"?

kjgorman commented 7 years ago

I see this might be a duplicate of https://github.com/JoshClose/CsvHelper/issues/422 although using slightly different examples. I guess the answer is "technically the csv file is invalid". Maybe I'll just use the trim fields options.

JoshClose commented 7 years ago

It's hard when getting files from other systems. Lots of them write custom stuff and don't realize there is an actual standard.

jamesbascle commented 7 years ago

To be just slightly pedantic - an RFC is just that. 4180 has never been ratified or been made a standard.

That said, I too would advocate for simply using the trim fields option