Smrtovrisk / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

New Importer does not accept special separator characters completely such as Unicode chars #475

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a Refine project with attached UTF-8 encoded text file.
2. In preview, try to use separator char for a "Start of Header" SOH typically 
expressed in Javascript as \u0001, or \x01, or \\ or \' or \" etc...

What is the expected output? What do you see instead?

Refine does not separate using any \ escaped special characters as shown here: 
http://www.c-point.com/javascript_tutorial/special_characters.htm

I also tried unchecking "Quotation marks are used..." and "Parse cell text into 
numbers, date..."  but still no changes with additional columns shown in 
preview while trying to use various special separator characters.

Original issue reported on code.google.com by thadguidry on 4 Nov 2011 at 3:11

Attachments:

GoogleCodeExporter commented 9 years ago
Update: I imported the attached test file as-is.  Then used Edit Column -> 
Split into several columns.  Used Split by Separator with Regular Expression 
checked and input the "Unicode code point" 
http://www.regular-expressions.info/unicode.html of \u0001 and clicked OK.  
That worked.

However, I still would like us to support any valid "Unicode code points" as 
separator characters during the importer preview stage.

Original comment by thadguidry on 4 Nov 2011 at 3:49

GoogleCodeExporter commented 9 years ago
Bounty $100 from me, via Paypal to anyone who can fix this and add enhancement 
of supporting any valid "Unicode code points" as separator characters during 
the initial importer preview stage.

Original comment by thadguidry on 4 Nov 2011 at 3:53

GoogleCodeExporter commented 9 years ago
Does cutting the character from someplace like a Character Map utility and 
pasting it into the field work?

It sounds like what you're really after is the ability to use some type of 
quoting/escaping notation for your separator characters.

Original comment by tfmorris on 4 Nov 2011 at 4:00

GoogleCodeExporter commented 9 years ago
NO, cutting and pasting the character within Ubuntu did not work for me, nor 
did it on Windows 7, I even tried Alt - numeric keypad 0 1 with no luck on the 
separator char input box on the importer preview for CSV/TSV/separator.  Yes, 
agreed, ideally we really need to support the backslash escaping during the 
initial importer preview.  This would ease data entry when using VNC 
connections to Refine instances, instead of having to send over special 
keyboard command/control chars if a user is on Windows while remoting through a 
VNC connection to Refine running on Ubuntu.

Original comment by thadguidry on 4 Nov 2011 at 4:12

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 4 Nov 2011 at 4:50

GoogleCodeExporter commented 9 years ago
FYI Tom, since you've started (thanks!), David says you'll need to do proper 
Javascript unescaping for this.

Original comment by thadguidry on 4 Nov 2011 at 4:56

GoogleCodeExporter commented 9 years ago
Fixed in r2355.  Escaping and unescaping is now done server-side with escaped 
strings going over the wire instead of raw characters.  Escaping syntax used is 
Java's, but it's easy to switch to Javascript if people prefer that.  For the 
vast majority of stuff, they're probably identical.

Note that for your example, Refine correctly guesses the separator character, 
so you shouldn't actually even have to type it in.

Original comment by tfmorris on 4 Nov 2011 at 7:08