JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.75k stars 1.06k forks source link

Two double quotes at beginning of field possibly ignored #305

Closed Dpetters closed 10 years ago

Dpetters commented 10 years ago

Hi Josh,

I'm having a little trouble understanding why the reader returns identical values for the following two rows -

"en-us",""mar¡a marsa"","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00" "en-us",mar¡a marsa"","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00"

In both cases the value returned for the second column is mar¡a marsa\"\" I would expect the reader to throw the CsvBadDataException for the first row, and for me to have to fix the row to be as follows (have three quotes on both side of the value) to get the desired value of "mar¡a marsa" -

"en-us","""mar¡a marsa""","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00"

Here are my configs -

config.IgnoreQuotes = false; config.Quote = '"', config.DetectColumnCountChanges = true;

Any tips are greatly appreciated! Thank you in advance!

JoshClose commented 10 years ago

CsvBadDataException is only thrown when there are inconsistent field counts. Both of those lines have the same amount of fields.

Since there are so many bad CSV files out there, I found that CsvHelper can't parse using strict RFC 4180 rules. About 50% of the files out there won't be able to be read. I took the stance that CsvHelper should just read everything without troubles. It will read the file just like Excel does. If you open the file in Excel, you will see the exact same results.

There currently is no bad field detection and the parser just reads the file as it sees it. It could keep track of that info and do a callback or throw on a bad field or something. I would have to think of every scenario that is a bad field though...

  1. If any char comes before the first quote. a"abc" "abc"
  2. If any char comes after the last quote. "abc" d

I think those are the only 2 cases. If you can think of more, let me know. Anything weird needs to be in quotes. If there is anything outside the quotes, it's bad. This handles the case of there being quotes in a field that isn't quoted.

As I'm looking at the parser, This probably won't be too difficult to implement.

How do you think it should be done?

  1. Throw an exception.
  2. Call a callback method or event so that you can log them, instead of it blowing up.

I could do both and have configuration options for them. It would more than likely always call the callback. There is no reason to turn that off. You would need to turn on the exception throwing though.

Dpetters commented 10 years ago

Josh, thank you for your prompt response. I agree with everything you said. The two scenarios you listed seem to cover the cases I had in mind - I'll follow up if I think of any new ones.

As far as I understand the parser already throws exceptions in certain cases which cannot be ignored and you would not want to make it stricter by default. Would you add a new field just to turn on parser exception throwing for these two scenarios (e.g. DetectBadFields, default false)? And would the new callback only be called when either of these two scenarios is violated (e.g. BadFieldCallback, similar to ReadingExceptionCallback)?

Let me know if that's along the lines of what you were thinking or not.

JoshClose commented 10 years ago

The parser will only throw CsvBadDataException if you have turned on DetectColumnCountChanges and a row with a different column count is found. That is the only exception the parser throws. The reader has the possibility of throwing a lot of exceptions for a lot of reasons though.

I would add a config options for something like bool Configuration.ThrowOnBadField = false;, and Action<BadFieldData> Configuration.BadFieldCallback = data => { };

If you turn on the throw, it will throw and exception when a bad field is detected, with data on the bad field. It will always call the callback and will be a no-op by default.

Dpetters commented 10 years ago

That sounds great! Will you close this issue with a commit once you're able to get to it?

JoshClose commented 10 years ago

Yes, I will. I actually did most of the work before you posted that, but got real busy and haven't been able to finish it. I will get to it soon though.

Dpetters commented 10 years ago

No worries, take your time. Really appreciate it!

JoshClose commented 10 years ago

I committed a change. dbb36367b6df6e86d7572dcb6c91ff6e9cf1dd3a

This is a part of NuGet package 1.8.0.

Dpetters commented 10 years ago

Hey Josh, is there any chance that change dbb3636 didn't make it into the 2.8 version of the nuget package? I can't seem to spot the new configuration options.

JoshClose commented 10 years ago

https://github.com/JoshClose/CsvHelper/commit/dbb36367b6df6e86d7572dcb6c91ff6e9cf1dd3a#diff-37d6e5d0da06f0f98d5973fc156f9b51R390

https://github.com/JoshClose/CsvHelper/commit/dbb36367b6df6e86d7572dcb6c91ff6e9cf1dd3a#diff-37d6e5d0da06f0f98d5973fc156f9b51R396

JoshClose commented 10 years ago

Yeah, looks like it's missing from the nuget package. I'll fix that tonight.

Dpetters commented 10 years ago

Thanks a bunch!

JoshClose commented 10 years ago

Sorry, got busy. I'll try to get to it ASAP though.

JoshClose commented 10 years ago

Done.

Dpetters commented 10 years ago

You rock, thanks!

Dpetters commented 10 years ago

Hey Josh, sorry to keep borthering you, but is there any chance that this line should read if(fieldIsBad && configuration.ThrowOnBadData )?

JoshClose commented 10 years ago

Geez, that was dumb. Fixed that issue and fixed the unit test that was covering it. The fix is in 2.8.2 on NuGet. Thanks for finding that.