evanplaice / jquery-csv

A jQuery CSV parser plugin. Battle Tested | Optimized | 100% IETF RFC 4180 Complete
MIT License
763 stars 441 forks source link

Freaks over over extra blank lines #14

Closed evanplaice closed 9 years ago

evanplaice commented 9 years ago

From Chad.R.B...@gmail.com on October 10, 2012 13:42:48

What steps will reproduce the problem? 1. Pass a CSV file with one or more blank lines into $.csv.toObjects(file) What is the expected output? What do you see instead? I expect it to just ignore the blank lines, or do something with them. At least a blank line at the end. I would expect that an extra blank line at the end of CSV files is a common occurrence.

Specifically, it errors here, at line 365.

for(var j in headers) { object[headers[j]] = entry[j]; }

Uncaught TypeError: Cannot read property '0' of undefined

This is because in the case a blank line, entry is undefined. What version of the product are you using? On what operating system? 0.64 on Windows. Please provide any additional information below. It's up to you to decide what your program should do in this case. I figured it out and I can remove the blank lines from my input, but if one of my users puts a blank line in their input, it will just silently fail. I would prefer that to not happen. I guess alternatively, I could check the input before putting it through your function and removing the blank line myself.

Original issue: http://code.google.com/p/jquery-csv/issues/detail?id=13

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 10, 2012 15:06:51

Could you provide a data sample stub that causes the failure.

From a technical standpoint, even empty lines should contain delimiters to represent the number of columns in the data.

To quote RFC 4180:

  1. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:

    aaa,bbb,ccc

Only the last line is allowed to be completely empty in a valid CSV file and the parser should handle it properly if one exists.

So an empty line with 3 columns would still show up as:

,,

Basically, a CSV file must have the same number of columns in every row to be considered valid. Even if the trailing values on each row are consistently empty.

There are two potential solutions to this problem...

First, you could pre-process the line data. A hook called onParseValue() exists to allow users to insert an arbitrary function to do additional processing on CSV values. To give users better fine-tuned control over the parser I was planning on including another called onParseEntry() that does the same but on a full line of data.

With that in place you will be able to insert a function that adds in the correct number of delimiters (ex commas).

The second solution works the same as you have already done. Do a full pass on the data first to remove empty lines. I'll be adding two hooks, onPreParse() onPostParse() to do the same.

There are two schools of thought on this. Those who want to liberate the data by making the parser more versatile and those who think the users should be held responsible for providing 'clean' data.

To see a bad example of the too-liberal approach, just look at pre-HTML 4.01 browsers and 'quirks mode'. Too much variability in the data requires a lot of 'magic' (read 'hacks') to exist below the surface.

CSV already a very liberal format that allows a lot of funky edge cases that need to be covered in the parser. My main overarching goal is to provide a parser that specializes in CSV data and not necessarily anything beyond CSV.

There are a LOT of use cases where users want to have the ability to extend the parser to do more interesting things. I don't want to disregard that request altogether so my approach is to provide hooks. By doing so users have the ability insert custom code inline during the different phases of the parsing process and add their own domain-specific magic where it's needed.

Let me know what you think...

evanplaice commented 9 years ago

From Chad.R.B...@gmail.com on October 10, 2012 16:32:33

You make some good points. I agree that you should stick to the standard, and using anything that isn't standard should be the responsibility of the person using your library. Providing those hooks adds additional convenience for them. I didn't know what the definition of the standard was.

However, here is the issue, Microsoft. The way I make the CSV files in the first place is to use MS Excel, and use save as CSV (Windows/DOS format). I am expecting that the majority of my users would do it this way too. At least with MS Excel 2013 (beta), it automatically puts an empty line, no commas, at the end of the file. Actually, I just checked, it does that in Excel 2010 too. So Microsoft is doing something wrong, typical. If you open this file with a decent text editor, like Notepad++, you will see the blank line.

I didn't realize that what Microsoft is doing was wrong until just now. I can't expect my users to manually trim that line themselves, so I will have to put code to do it myself. I didn't have this problem with version 0.62, so you did something that made this more strict.

If you are feeling generous, what code would you use to take the last line off if it was empty? I am primarily a MATLAB coder, and still a novice at JavaScript.

Attachment: test.csv

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 10, 2012 21:08:26

Owner: evanpla...@gmail.com

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 10, 2012 21:08:41

Status: Accepted

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 10, 2012 21:25:15

An empty line at the end of the file is legal, that's the one exception to the rule.

  1. The last record in the file may or may not have an ending line break. For example:

    aaa,bbb,ccc CRLF zzz,yyy,xxx

Sorry about the confusion...

The RFC can be found here: http://tools.ietf.org/html/rfc4180 If it's still failing, it may be a legitimate bug. I haven't created any tests to validate the functionality of toObjects() so I probably missed a bug when I wrote it.

Give me a little time to update the tests with full toObjects() coverage. As soon as I push it, I'll let you know where to go to run the test runner.

If the last line is the only empty line, I guarantee that you won't need to do any additional pre-processing.

If you want to get rid of empty lines in the middle of the file take a look at: https://code.google.com/p/jquery-csv/wiki/Hooks?ts=1349929348&updated=Hooks The hooks aren't done being implemented yet but once they are you should be able to strip empty lines using the example found in the onPreParse() usage demo.

evanplaice commented 9 years ago

From danielti...@gmail.com on October 12, 2012 06:57:46

I just tested this and the final blank line gets computed to undefined:

eData[0] is =>Del,Trotter,Central Sales Supervisor,Xways,Central Sales,Crossways,,EX,,,,,01234 123 456,,2040,6040,,01234 123 321,,Del Trotter,EX,/o=TITUK/ou=TIT/cn=RECIPIENTS/cn=DEL TROTTER,Del.Trotter,,, eData[1] is =>Del,Trotter,Central Sales Supervisor,Xways,Central Sales,Crossways,,EX,,,,,01234 123 456,,2040,6040,,01234 123 321,,Del Trotter,EX,/o=TITUK/ou=TIT/cn=RECIPIENTS/cn=DEL TROTTER,Del.Trotter,,, eData[2] is =>Del,Trotter,Central Sales Supervisor,Xways,Central Sales,Crossways,,EX,,,,,01234 123 456,,2040,6040,,01234 123 321,,Del Trotter,EX,/o=TITUK/ou=TIT/cn=RECIPIENTS/cn=DEL TROTTER,Del.Trotter,,, eData[3] is =>Del,Trotter,Central Sales Supervisor,Xways,Central Sales,Crossways,,EX,,,,,01234 123 456,,2040,6040,,01234 123 321,,Del Trotter,EX,/o=TITUK/ou=TIT/cn=RECIPIENTS/cn=DEL TROTTER,Del.Trotter,,, eData[4] is =>Del,Trotter,Central Sales Supervisor,Xways,Central Sales,Crossways,,EX,,,,,01234 123 456,,2040,6040,,01234 123 321,,Del Trotter,EX,/o=TITUK/ou=TIT/cn=RECIPIENTS/cn=DEL TROTTER,Del.Trotter,,, eData[5] is =>undefined

evanplaice commented 9 years ago

From danielti...@gmail.com on October 13, 2012 02:40:06

BTW the above was using eData = $.csv.toArrays(data); so its not just toObjects that has the issue

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 15, 2012 15:32:22

This issue was closed by revision 5f9f33616ce2 .

Status: Fixed

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 15, 2012 15:39:00

The fix done and will be incorporated into the 0.65 release

Two new tests were added to the test runner under 'Edge Cases' to check for the problem and the bug was fixed in jquery.csv.js.

If you want to check it, feel free to run the test runner again: http://jquery-csv.googlecode.com/git/test/test.html You can also try it out with your own dataset at the new 'Basic Usage Demo': http://jquery-csv.googlecode.com/git/examples/basic-usage.html Thank you for your help in getting this fixed. Hopefully, that will rid you of the need for any obnoxious hacks/workarounds in the future.

evanplaice commented 9 years ago

From Chad.R.B...@gmail.com on October 15, 2012 20:54:34

Yea I like this, I don't need to worry about the blank line, or use a length - 1 hack. I also like where in your test you say reticulating splines. I guess you are a The Sims fan.

evanplaice commented 9 years ago

From evanpla...@gmail.com on October 15, 2012 23:19:25

Yeah, the hack you described is basically how I handle it, except the check is done inline.

Actually, Sim City 2000 (I'm old school). I'm glad you liked it. I created the demo to help lower the barrier for new users but it doubles for testing datasets too.

It's still very primitive but if take a look at the file-handling demo, you can open csv files directly from the desktop (using the HTML 5 File API) too.

Here's the link if you'd like to try it out: http://jquery-csv.googlecode.com/git/examples/file-handling.html That should be all for this bug. Unless you have more to add, I'm going to close it out.

Status: Verified

evanplaice commented 9 years ago

From danielti...@gmail.com on October 16, 2012 01:49:16

I tested using my CSV in your file-handling.html and it was handled perfectly, I also saved off the javascript file and used it in my above test and that also was handled perfectly, so I agree problem fixed