JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.63k stars 1.05k forks source link

Unexpected behaviour when using multiple character newline #2216

Open OMAlexB opened 6 months ago

OMAlexB commented 6 months ago

We have some files that are using |*| as the delimiter and |##|\r\n as the newline, this is comes from a third party application and we unfortunately cannot change it. We have found three issues while trying to parse this file using CSVHelper.

1) When the first character of a row is the same as the first character of the new line character it is treated as a blank line. Example with |##|\r\n as the new line character, |a,b,c|##|\r\n is treated as an empty row. I expect this should be treated as a full valid row |a,b,c|##|\r\n and not required to be wrapped in quotes because it does not contain the full newline. RFC 4180 obviously doesn't specify allowing custom newline characters, but if we are just replacing CRLF with |##|\r\n I wouldn't expect we should need to wrap this field in quotes because it does not contain the entire newline. In our particular case we don't have control of the file so we also can't wrap these fields in quotes regardless.

2) Similarly to the first issue when the first character in the newline is present in a record the row is cut off at that point, despite not containing the entire newline. Eg a|,b,c|##|\r\n is treated as two separate rows with the raw record for row 1 being a| and the raw record for row 2 being ,b,c|##|r\n. Similar reasoning to above for why this should be working as normal without the need for quotes.

3) When both the delimiter and new line character begin with the same character the parser believes every occurrence of a new line is a delimiter because it is only checking the first character (and checking delimiter first). Eg

a|*|b|*|c|##|\r\n
d|*|e|*|f|##|\r\n

is treated as a single row with 5 records in it. I expect this should be treated normally as 2 separate rows even though they are strange delimiter/new line characters.

We are looking at implementing fixes for these ourselves by just peeking at the next few characters to validate the entire delimiter or newline is present when under these circumstances and hoping to make a PR to merge back to here.

JoshClose commented 6 months ago

Peeking could be a problem due to the buffer. There is currently no peeking in the parser.

I think there is an issue when the first char of the delimiter and newline are the same. I also see an issue with blank lines and custom newlines.

For now, could you possibly run a replace on the file first, replacing |##|\r\n with \r\n? That's not ideal, but it may work for the time being.

OMAlexB commented 4 months ago

Oops, I didn't see your comment. I have opened up a PR to add support for peeking with the buffer and these fixes. It wasn't really feasible to run the full replace as they had are pretty big and are provided to us by a third party.