Closed Dpetters closed 10 years ago
CsvBadDataException
is only thrown when there are inconsistent field counts. Both of those lines have the same amount of fields.
Since there are so many bad CSV files out there, I found that CsvHelper can't parse using strict RFC 4180 rules. About 50% of the files out there won't be able to be read. I took the stance that CsvHelper should just read everything without troubles. It will read the file just like Excel does. If you open the file in Excel, you will see the exact same results.
There currently is no bad field detection and the parser just reads the file as it sees it. It could keep track of that info and do a callback or throw on a bad field or something. I would have to think of every scenario that is a bad field though...
a"abc"
"abc"
"abc" d
I think those are the only 2 cases. If you can think of more, let me know. Anything weird needs to be in quotes. If there is anything outside the quotes, it's bad. This handles the case of there being quotes in a field that isn't quoted.
As I'm looking at the parser, This probably won't be too difficult to implement.
How do you think it should be done?
I could do both and have configuration options for them. It would more than likely always call the callback. There is no reason to turn that off. You would need to turn on the exception throwing though.
Josh, thank you for your prompt response. I agree with everything you said. The two scenarios you listed seem to cover the cases I had in mind - I'll follow up if I think of any new ones.
As far as I understand the parser already throws exceptions in certain cases which cannot be ignored and you would not want to make it stricter by default. Would you add a new field just to turn on parser exception throwing for these two scenarios (e.g. DetectBadFields, default false)? And would the new callback only be called when either of these two scenarios is violated (e.g. BadFieldCallback, similar to ReadingExceptionCallback)?
Let me know if that's along the lines of what you were thinking or not.
The parser will only throw CsvBadDataException
if you have turned on DetectColumnCountChanges
and a row with a different column count is found. That is the only exception the parser throws. The reader has the possibility of throwing a lot of exceptions for a lot of reasons though.
I would add a config options for something like bool Configuration.ThrowOnBadField = false;
, and Action<BadFieldData> Configuration.BadFieldCallback = data => { };
If you turn on the throw, it will throw and exception when a bad field is detected, with data on the bad field. It will always call the callback and will be a no-op by default.
That sounds great! Will you close this issue with a commit once you're able to get to it?
Yes, I will. I actually did most of the work before you posted that, but got real busy and haven't been able to finish it. I will get to it soon though.
No worries, take your time. Really appreciate it!
I committed a change. dbb36367b6df6e86d7572dcb6c91ff6e9cf1dd3a
This is a part of NuGet package 1.8.0.
Hey Josh, is there any chance that change dbb3636 didn't make it into the 2.8 version of the nuget package? I can't seem to spot the new configuration options.
Yeah, looks like it's missing from the nuget package. I'll fix that tonight.
Thanks a bunch!
Sorry, got busy. I'll try to get to it ASAP though.
Done.
You rock, thanks!
Hey Josh, sorry to keep borthering you, but is there any chance that this line should read if(fieldIsBad && configuration.ThrowOnBadData )
?
Geez, that was dumb. Fixed that issue and fixed the unit test that was covering it. The fix is in 2.8.2 on NuGet. Thanks for finding that.
Hi Josh,
I'm having a little trouble understanding why the reader returns identical values for the following two rows -
"en-us",""mar¡a marsa"","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00" "en-us",mar¡a marsa"","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00"
In both cases the value returned for the second column is mar¡a marsa\"\" I would expect the reader to throw the CsvBadDataException for the first row, and for me to have to fix the row to be as follows (have three quotes on both side of the value) to get the desired value of "mar¡a marsa" -
"en-us","""mar¡a marsa""","http://lclcuic.hi5.com/friend/p231722756--Maria_Marsa--html","104","Fair","2012-01-10 08:27:00"
Here are my configs -
config.IgnoreQuotes = false; config.Quote = '"', config.DetectColumnCountChanges = true;
Any tips are greatly appreciated! Thank you in advance!