geocow / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Sort text with accents #202

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
When trying to sort rows, those containing accents are incorrectly sorted. 
Example : 

[[Turquie]]
[[Tuvalu]]
[[Ukraine]]
[[Uruguay]]
[[Vanuatu]]
[[Venezuela]]
[[Vietnam]]
[[Égypte]]      <= oops ! should be classed with E
[[Équateur (pays)|Équateur]]
[[États-Unis]]
[[Îles Marshall]]

Original issue reported on code.google.com by leblanc....@gmail.com on 14 Nov 2010 at 4:56

GoogleCodeExporter commented 8 years ago
Would the proposed behaviour be correct for all languages which use a 
diacritic, or any other type of ancillary glyph?  I'm not so sure it would be.

The current behaviour of Refine takes the safe approach, but, as you point out, 
at the cost of usability.

It should be possible to work around this by creating a new column based on the 
existing, and converting all characters in the new column to ASCII-only.  You 
could then sort on that row.

Original comment by iainsproat on 15 Nov 2010 at 4:12

GoogleCodeExporter commented 8 years ago
Collating sequences are locale specific, not language specific, but diacritic 
folding is a pretty basic need for many locales, the same way case folding is.  
Fortunately, it's super easy in Java to do the right thing by using the 
built-in collators and their associated CollationKeys.

http://download.oracle.com/javase/1.4.2/docs/api/java/text/CollationKey.html

Original comment by tfmorris on 16 Nov 2010 at 5:42

GoogleCodeExporter commented 8 years ago
I hadn't noticed that this had been marked as an enhancement request.  Sorting 
is a pretty basic capability and Refine should be supporting a wider audience 
than just U.S. English, so I feel not having basic sorting capabilities is a 
bug.

Perhaps fancy options are an enhancement, but Refine should do basic sorting 
using the system's default locale out of the box.

Original comment by tfmorris on 5 Dec 2010 at 8:47

GoogleCodeExporter commented 8 years ago
Refine will now use the collating sequence associated with the system's default 
locale to collate strings.  It also does a normalizing decomposition of 
characters so that é will collate the same independent of whether it is made 
up of a single character or a non-spacing accent decorating a base character.

Default collating strength is Collator.SECONDARY which maps roughly to the 
previous case-insensitive setting.  Setting the case sensitive bit changes this 
to Collator.IDENTICAL.  See the Java docs for more information on these 
settings and other possible options.

Original comment by tfmorris on 12 Dec 2010 at 6:21

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 9 Jun 2011 at 7:58