CSV import is too basic

GoogleCodeExporter commented 9 years ago

Consider this toy-csv-file:

name,description,yearOfBirth
Mary II,"Mary II was Queen regnant of England, Scotland, and Ireland from 
1689 until her death.", 1662
Napoleon Bonaparte,"Napoleon I.
He was a military and political leader of France and Emperor of the French 
as Napoleon I, whose actions shaped European politics in the early 19th 
century.",1769

The commas in Mary's description are 'escaped' by using the " mode. The 
same is done for the comma and the line break in Napoleon's description. 
Pretty common for real-life data.

So two data rows should be detected, (one including a line break). Instead 
Three rows are created on import. Not too smart - considering that such an 
'extended' escaping is very common, e.g. in exporters of spreadsheet 
software and as respective clipboard formats.

No way to import the file "correctly" (or to choose parsing mode) in 
Version 1.0-r667, running on Windows XP.

Original issue reported on code.google.com by eferonline on 11 May 2010 at 1:09

GoogleCodeExporter commented 9 years ago

Original comment by dfhu...@gmail.com on 11 May 2010 at 7:07

Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Fixed in r717. eferonline, would you be able to check out the code and verify 
the fix?

Original comment by dfhu...@gmail.com on 12 May 2010 at 6:10

Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Thanks for the quick reaction!

I loaded and built the new revision. It works better now, giving me the correct 
number of records. The line break in the description field however seems to be 
gone. 
(But I can add it manually in the gridworks editor via clipboard - so it seems 
technically possible to have line breaks in fields). This should be fixed, too.

As for the source code change: I shamefully have to admit that I don't really 
get the 
importer code and what change exactly did the trick for this issue. I would 
have 
expected a kind of finite state automaton or something to manage the parser 
"modes" 
but could not find an equivalent in the sources. Unfortunately I'm a bit short 
on 
time to review the code in detail.

Original comment by eferonline on 12 May 2010 at 8:27

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

PS: The fix only solves the problem if the separator chars are commas. For tabs 
the old 
behavior occurs.

Original comment by eferonline on 12 May 2010 at 8:38

Added labels: ****
Removed labels: ****

Attachments:

GoogleCodeExporter commented 9 years ago

Fixed for TSV as well by r790. Please verify.

Original comment by dfhu...@gmail.com on 17 May 2010 at 5:57

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

It's nearly fixed now. But if one line break in a "-escaped area comes directly 
after 
another (which means there is a blank line before the text continues) the 
record is 
still split. It should be possible to have an unlimited number of linebreaks in 
the 
field value before the escape sequence ends and the next field is processed.

Original comment by eferonline on 17 May 2010 at 6:54

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I've added a unit test for this multiple blank line case in r794.  test fails.

Original comment by iainsproat on 17 May 2010 at 7:04

Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Should be fixed in r797.  Please verify.

Original comment by iainsproat on 17 May 2010 at 12:01

Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Verified. It work's now as expected. Great!

Original comment by eferonline on 17 May 2010 at 12:14

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Is there any way I can work around this problem without downloading and 
building Google Refine from source? Can I convert the input file to another 
format or escape characters differently?

Original comment by andreas....@gmail.com on 20 May 2011 at 8:23

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Why not just use a text editor and do a find/replace for the double quote 
character " to something like a triple carat ^^^ ?  Import without the 
splitting option or quote char option.  Then once it's in Google Refine, 
perform your splits manually with GREL or Add column against the commas and ^^^ 
?  Would that work ?

Original comment by thadguidry on 20 May 2011 at 1:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

That workaround would help for some cases with embedded tabs and commas, but 
not for line breaks, I suspect.

Original comment by tfmorris on 20 May 2011 at 5:00

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 18 Sep 2012 at 2:21

Added labels: Milestone-2.0
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 18 Sep 2012 at 2:52

Added labels: Milestone-1.0
Removed labels: Milestone-2.0

Jdharden / google-refine

CSV import is too basic #19