bmohdz21 / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

reinterpret() no longer seems to work as expected #237

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. after creating new project with attached file with defaults
2. trying reinterpret(value,"utf-8") on columns gives error
3. screenshot on my system... http://awesomescreenshot.com/0af3msr38

What is the expected output? What do you see instead?

no error should show, but value instead.  Attached File used has been converted 
to straight ANSI using Notepad++ on Windows 7. (At least it said it did!)  I 
was hoping to somehow use reinterpret to replace unicode 160 non breaking space 
with a regular space.  My thoughts were that I could use 
reinterpret(value,"ansi") ?  Is that supported ? I can't recall.  Our Quick 
Recipes page does have a link to this: 
http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html

We also need to finish documenting the reinterpret GREL string function with 
all of it's options on the wiki, used to be there, and no longer fully 
documented.

Original issue reported on code.google.com by thadguidry on 18 Nov 2010 at 4:41

Attachments:

GoogleCodeExporter commented 9 years ago
I've fixed a few things here:
- project creation will no longer leave the encoding unset (the cause of your 
NullPointerExceptions)
- I've lowered the minimum confidence threshold from 50 to 20 for the character 
set guesser.

Ultimately I think we should be allowing the user more control over the 
character encoding.

Note that the reinterpret() function won't actually do what you desire.  You 
want to use something along the lines of value.replace(' ',' ') [Where the 
first literal contains a NBSP]

For anyone else who's attempting to reproduce this, if your system's default 
character encoding is UTF-8, as mine is, you won't even get as far as Thad.  
Instead you'll end up with all the non-breaking spaces substituted with the 
replacement character (because the ISO Latin-1 NBSP character is invalid 
UTF-8).  No amount of reencoding will save you at that point.

Original comment by tfmorris on 26 Nov 2010 at 10:20

GoogleCodeExporter commented 9 years ago
Fixed in rev 1931.

Original comment by tfmorris on 26 Nov 2010 at 10:25

GoogleCodeExporter commented 9 years ago
Issue 164 has been merged into this issue.

Original comment by tfmorris on 27 Nov 2010 at 12:38

GoogleCodeExporter commented 9 years ago
Issue 386 has been merged into this issue.

Original comment by tfmorris on 25 May 2011 at 5:23

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 9 Jun 2011 at 7:58

GoogleCodeExporter commented 9 years ago
OK, those supported encodings SHOULD work, however they do not with 
reinterpret() function.  We have that issue logged, but I would really like to 
see that fixed. NOW.  Before we release 2.5

A simple test such as:

"Can we fix this?!?".toString().escape("html")  WORKS :)

"Can we fix this?!?".toString().reinterpret("utf-8") Error: reinterpret: 
encoding 'utf-8' is not available or recognized.  FAILS :(

reinterpret("Can we fix this?!?","utf-8")  Error: reinterpret: encoding 'utf-8' 
is not available or recognized.  FAILS :(

$10 Paypal bucks for the person who fixes this first, from me !

Original comment by thadguidry on 21 Oct 2011 at 10:09

GoogleCodeExporter commented 9 years ago
All three examples work without error on my Ubuntu system when testing against 
SVN trunk.

What O/S are you using?  Does the problem affect only utf-8 or all encodings?  
Have you tried any variations such as "UTF-8" or "UTF8" ?

I'll boot Windows to check it there after I've finished up some other stuff.

Original comment by tfmorris on 21 Oct 2011 at 10:53

GoogleCodeExporter commented 9 years ago
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Windows\system32>java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)

 Yes, tried "ascii" "us-ascii" "US-ASCII" "UTF8" utf8" "UTF-8" "utf-8" "latin-1" "LATIN-1" "BIG5" "big5"

Original comment by thadguidry on 21 Oct 2011 at 11:11

GoogleCodeExporter commented 9 years ago
Windows 7 64 bit

Original comment by thadguidry on 21 Oct 2011 at 11:14

GoogleCodeExporter commented 9 years ago
Both cases labelled as failing also work on Windows XP with a Sun Java 1.6.0 
JVM.

There are only a couple of things which come to mind as possibilities:

1. It's a Java 7 specific bug, although that seems like a pretty big thing for 
them to have broken and not noticed.

2. The encoding stored in the project is messed up (perhaps an old project from 
back when the encoding could be null due to a bug).  There are two character 
encodings involved in this operation, the source encoding and the destination 
encoding, so it might not be the "utf-8" which is the problem.

I suggest that we move the discussion someplace other than this bug report (the 
dev list?) since I'm pretty convinced it's not a regression of this bug fix.

Original comment by tfmorris on 26 Oct 2011 at 11:43

GoogleCodeExporter commented 9 years ago
Agreed, push this up to the dev list so we can talk and test the crap out of 
this.  It is really bugging me.  I do have my JAVA_HOME path set to 1.6.24 
version, btw.

Original comment by thadguidry on 26 Oct 2011 at 11:48

GoogleCodeExporter commented 9 years ago
I'm fairly convinced that the underlying problem in comment 6 is that the 
project's encoding isn't set properly.  I've created a new issue 486 to track 
this.

Original comment by tfmorris on 18 Nov 2011 at 11:35