Open jjesusfilho opened 3 years ago
Could you please paste the output of a call to stringi::stri_info()
here? Thank you.
I wonder if strength=2 is what you might need:
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1] TRUE TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1] TRUE TRUE FALSE FALSE
Could you please paste the output of a call to
stringi::stri_info()
here? Thank you.
$Unicode.version [1] "10.0"
$ICU.version [1] "61.1"
$Locale $Locale$Language [1] "pt"
$Locale$Country [1] "BR"
$Locale$Variant [1] ""
$Locale$Name [1] "pt_BR"
$Charset.internal [1] "UTF-8" "UTF-16"
$Charset.native $Charset.native$Name.friendly [1] "UTF-8"
$Charset.native$Name.ICU [1] "UTF-8"
$Charset.native$Name.UTR22 [1] NA
$Charset.native$Name.IBM [1] "ibm-1208"
$Charset.native$Name.WINDOWS [1] "windows-65001"
$Charset.native$Name.JAVA [1] "UTF-8"
$Charset.native$Name.IANA [1] "UTF-8"
$Charset.native$Name.MIME [1] "UTF-8"
$Charset.native$ASCII.subset [1] TRUE
$Charset.native$Unicode.1to1 [1] NA
$Charset.native$CharSize.8bit [1] FALSE
$Charset.native$CharSize.min [1] 1
$Charset.native$CharSize.max [1] 3
$ICU.system [1] FALSE
$ICU.UTF8 [1] FALSE
I wonder if strength=2 is what you might need:
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR") ## [1] TRUE TRUE FALSE FALSE stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE) ## [1] TRUE TRUE FALSE FALSE
Actually, what I want is to ignore accents, but not case. According to the ICU documentation mentioned above, I should achieve that by setting strength to level 1 and case level to On. I also work with PostgreSQL and it works accordingly, but with stringi, I can only get the same result by setting locale to '"", regardless of case_level's setting.
stringi::stri_detect_coll(c("Mário","mário"), "mario", strength=1L, locale = '')
[1] FALSE TRUE
First of all, thanks, there was a bug; locale=""
should mean locale=NULL
, i.e., your own locale, pt_BR
.
instead, locale=''
meant locale="POSIX"
- that is why it worked as expected (and perhaps this is what postgresql uses, hence the correct results). I would recommend setting locale="POSIX" explicitly then.
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 1L, locale="POSIX")
## [1] FALSE TRUE FALSE TRUE
Yes, but I still believe that it can get a bit confusing because, in order to ignore accents but not case, in stringi, you have to set the locale, but not the case_level, when this is not the way PostgreSQL works and probably other languages, i.e, you must have the case level turned on along with the primary strength. This is also what the ICU documentation says:
"The case level is independent from the strength of comparison. It is possible to have a collator set to
primary strength with the case level turned on. This provides for comparison that takes into account the
case differences, while at the same time ignoring accents and tertiary differences other than case"
Hmmm.... interestingly, a collator-based string comparison honours the above rule...
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L)
[1] FALSE TRUE FALSE FALSE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
[1] FALSE TRUE FALSE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=1L)
[1] TRUE TRUE TRUE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=2L)
[1] TRUE TRUE FALSE FALSE
I was trying hard to figure out why usearch
returns a different result below, but with no success. A bug in ICU?
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] TRUE TRUE TRUE TRUE
stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] FALSE TRUE FALSE TRUE
(note to self): ICU 69.1 gives the results as above. @TODO: create a minimal reproducible example outside of stringi
[note to self]
Yes, this is reproducible outside of stringi:
/* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata && ./a.out */
#include <unicode/ustring.h>
#include <unicode/rbbi.h>
#include <unicode/coll.h>
#include <unicode/ucol.h>
#include <unicode/stsearch.h>
#include <cstdio>
using namespace icu;
#define WITH_CHECK_STATUS(f) \
status = U_ZERO_ERROR; \
f; \
if (U_FAILURE(status)) {printf("error %s!\n", u_errorName(status));return 1;}
int main()
{
const char* haystacks[] = {
"mario", "Mario", "MARIO", "MÁRIO", "Mário", "mário", "dario"
};
const char* needle = "mario";
UErrorCode status;
UCollator* col;
UStringSearch* matcher;
int v;
printf("U_ICU_VERSION=%s\n", U_ICU_VERSION);
WITH_CHECK_STATUS(col = ucol_open("pt_BR", &status))
WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_STRENGTH, UCOL_PRIMARY, &status))
WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_CASE_LEVEL, UCOL_ON, &status))
UnicodeString _needle = UnicodeString::fromUTF8(needle);
UnicodeString haystack = UnicodeString::fromUTF8("whatever");
WITH_CHECK_STATUS(matcher = usearch_openFromCollator(
_needle.getBuffer(),
_needle.length(),
haystack.getBuffer(),
haystack.length(),
col, NULL, &status
))
for (int i=0; i<sizeof(haystacks)/sizeof(haystacks[0]); ++i) {
printf("%s vs. %s: ", needle, haystacks[i]);
haystack = UnicodeString::fromUTF8(haystacks[i]);
WITH_CHECK_STATUS(
usearch_setText(matcher, haystack.getBuffer(), haystack.length(), &status)
)
usearch_reset(matcher);
WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=usearch_first(matcher, &status)))
printf(" usearch=%d", v);
WITH_CHECK_STATUS(v = (int)UCOL_EQUAL==ucol_strcollUTF8(col,
haystacks[i], -1, needle, -1, &status))
printf(" ucol=%d", v);
RuleBasedCollator* rbc = RuleBasedCollator::rbcFromUCollator(col);
WITH_CHECK_STATUS(StringSearch matcher2(needle, haystack, rbc, NULL, status))
WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=matcher2.first(status)))
printf(" StringSearch=%d", v);
printf("\n");
}
ucol_close(col);
usearch_close(matcher);
return 0;
}
yields:
U_ICU_VERSION=67.1
mario vs. mario: usearch=1 ucol=1 StringSearch=1
mario vs. Mario: usearch=1 ucol=0 StringSearch=1
mario vs. MARIO: usearch=1 ucol=0 StringSearch=1
mario vs. MÁRIO: usearch=1 ucol=0 StringSearch=1
mario vs. Mário: usearch=1 ucol=0 StringSearch=1
mario vs. mário: usearch=1 ucol=1 StringSearch=1
mario vs. dario: usearch=0 ucol=0 StringSearch=0
All right, it turns out that this issue has already been reported. It is ICU-related.
Thank you very much for taking your time to work this out. I have followed your thread posts and tried to find the issue on ICU's bug page, but couldn't figure it out.
According to the Unicode Technical Standard #35, when case Level is set to On and strength set to primary, ICU should ignore accents but take case into account. This is not the behavior I find in
stringi
package. To achieve that, I have to set locale = "", which is not what I expected. Examples:This is ok:
This should return FALSE:
In order to achieve that I need to set locale to "" on
stringi
:Here I respect case but don't add accent and get the expected result: