clcron / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
1 stars 0 forks source link

Ascii.caseInsensitiveEquivalence() #580

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The utility class Predicates contains a few String/CharSequence linked 
Predicate-factory methods. One of them that I write the most on top of the 
Predicate API is equalToIgnoreCase. Is it possible to add this method in the 
Predicates class? Its implementation would use the equalsIgnoreCase method of 
String.

Original issue reported on code.google.com by ogregoire on 25 Mar 2011 at 2:19

GoogleCodeExporter commented 9 years ago
It would be better to generalize and create a Predicates.compareTo(Comparator) 
and then do Predicates.compareTo(String.CASE_INSENSITIVE_ORDER).

Original comment by amer...@gmail.com on 25 Mar 2011 at 4:17

GoogleCodeExporter commented 9 years ago
I agree with you on the general case of comparables, but I think that the 
equalsIgnoreCase is unique enough to have its own predicate, and would be used 
enough on its own. For instance, I've worked on about seven-eight projects 
since I work with Guava and I did not use such an implementation in only one of 
these.

Original comment by ogregoire on 25 Mar 2011 at 6:07

GoogleCodeExporter commented 9 years ago
I don't know yet that this is something we will want to provide. But if we do, 
it would almost certainly be in our Ascii class, so its ASCII-only nature as 
clear as we can make it. We don't want to become an internationalization 
library.

Perhaps Ascii.caseInsensitive() would return an Equivalence<String> or 
Equivalence<CharSequence>, from which I'd hope it would be easy to get a 
Predicate, through some hard-to-name method like 
Ascii.caseInsensitive().<something>("targetstring").

Incidentally, I happen to feel that case-insensitive comparison is *way* 
overused. I don't dispute that you probably do have hard requirements for it, 
it's just that I've seen it used more often when there was no such requirement 
than when there was.

Original comment by kevinb@google.com on 26 Mar 2011 at 5:12

GoogleCodeExporter commented 9 years ago
I understand you don't want Guava to find equivalences between the following 
French words: "côté" (side), "cote" (rating), "coté" (rated) and "côte" 
(coast). This is what I call internationalization.

But I still expect this method to return true when it compares "côté" and 
"CÔTÉ" as does "côté".equalsIgnoreCase("CÔTÉ"), even if these strings use 
characters that are not ASCII.

I hope we agree on the term "internationalization" over here.

Original comment by ogregoire on 29 Mar 2011 at 5:47

GoogleCodeExporter commented 9 years ago
Hrm.  It is possible that String.equalsIgnoreCase() really is safe, and I was 
simply overgeneralizing the problems with String.CASE_INSENSITIVE_ORDER as 
applying to it as well.

Still, it's i18n-sensitive enough that I just don't trust JDK libraries for it 
one bit; I'd always recommend putting your faith in ICU4J instead.  Which has a 
handy com.ibm.icu.util.CaseInsensitiveString class, btw.  It's not just that 
ICU4J is more correct and more closely maintained, it's that it's easier for 
users to stay on the newest version of it even when they can't move to the next 
JDK release for whatever reason.

Original comment by kevinb@google.com on 29 Mar 2011 at 11:30

GoogleCodeExporter commented 9 years ago
Isn't it possible then to create the method and explain your concerns in the 
documentation? Or explain how the method is implemented (in this case on 
equalsIgnoreCase)? Or even explicitly mention to look after libraries like 
ICU4J in the doc if more granularity is needed?

Well Guava is not an internationalization library and it is not its scope, I 
fully agree on that, but isn't it a library that does make the use of Java more 
smooth? I read on the front page "basic string processing"; I do think this is 
a rather basic use case in the Java world.

If I suggest this method over here, it's really to help other programmers 
having something more standardized as I have already implemented a version of 
the Predicate equalToIgnoreCase and use it quite regularly.

Original comment by ogregoire on 30 Mar 2011 at 7:38

GoogleCodeExporter commented 9 years ago
Our i18n experts here confirmed that String.equalsIgnoreCase() is not 
i18n-smart.  If we're lucky they'll come here and explain why (I invited them 
to).  Suffice it to say I'm convinced we want to only support Ascii, or 
nothing, in Guava.

Original comment by kevinb@google.com on 30 Apr 2011 at 2:25

GoogleCodeExporter commented 9 years ago
Kevin's statement from comment 3 is spot on: "Case-insensitive comparison is 
*way* overused."

If the strings that you'd like to compare are both in ASCII, it's simple. 
You'll want something like Ascii.caseInsensitive().

If your strings can contain Unicode characters outside the ASCII range and you 
opt for "String.equalsIgnoreCase", then you're probably using the wrong method 
for what you intend to do. Instead, you'll probably want to use one of the 
Unicode normalization forms.

What case-insensitive string comparison is to ASCII, is Unicode normalization 
to Unicode.

See: A http://unicode.org/reports/tr15/

--

String matching is a spectrum from very loose matching to very exact matching.

 - In the ASCII world, the concepts are quite straight forward.
   - Often: Exact string match.
   - Often: Case insensitive string match.
   - Rare: Application specific solutions like case insensitive string match and ignore the difference between "-" and " " or ignore a trailing newline or period.

 - In the Unicode world, things get quite complex.
    - The Turkish "i" characters. Turkish has an upper-case "i" with a dot, a lower-case "i", and upper case "I" and a lower case "I" without a dot. That's the major reason why case mapping is language-sensitive. When you convert a string with a "i" to upper case, in English you get a "I" and in Turkish you get a upper case "i" with a dot. Not the same characters.
    - A character which looks exactly like another character can be composed of an entirely different code point sequence. E.g. you can either use a single character for a French "LATIN SMALL LETTER C WITH CEDILLA" (http://www.fileformat.info/info/unicode/char/e7/index.htm) or you can compose the character with a "c" and a "cedilla". One string might use one of these forms, the other the other form because it was entered into the system by a different person / process. When you compare "case insensitively" in English, you mean to ignore such Unicode character composition differences as well. But how do you do that? You need to use a Unicode normalizer, not just a mechanism to ignore case sensitivity. The same applies to the "Ô" in "CÔTÉ". It can be composed of different code point sequences and unless you apply Unicode normalization before matching, you won't get the results that you probably expect.
   - You don't just have a single space character, you have a series of whitespace characters. When you compare strings, should two strings be considered different just because they use different space characters?

--

Providing a "case insensitive" string comparison function in Guava would be a 
disservice to the community, unless it's for ASCII inputs only and very clearly 
documented as such. Developers would use such a function without understanding 
the ramifications and obviously write buggy applications.

If you don't deal with ASCII data, you'll probably want to use 
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html instead.

And my guess is that most developers actually want to use a Normalizer instance 
for something like normalization form "NFC" + case-folding +  ignoring the 
differences between the Turkish i characters.

Original comment by staudac...@google.com on 30 Apr 2011 at 3:21

GoogleCodeExporter commented 9 years ago
Extending what Andy said, best would be the Unicode's NFKC_CF normalization, 
which does casefolding, NFC normalization, but also NFKC normalization, so 
mapping full-width & half-width Katakana together, for example.

See also

Unicode Normalization -
http://unicode.org/reports/tr15/

Section 3.13 Default Case Algorithms in the Unicode Standard -
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf#G33992

Original comment by markda...@google.com on 30 Apr 2011 at 4:19

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 16 Jul 2011 at 7:53

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 4:04

GoogleCodeExporter commented 9 years ago
Someone can reopen this if they have new arguments justifying the need.

Original comment by kevinb@google.com on 16 Feb 2012 at 7:06

GoogleCodeExporter commented 9 years ago
This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:14

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:09