gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

Open jjesusfilho opened 3 years ago

jjesusfilho commented 3 years ago

According to the Unicode Technical Standard #35, when case Level is set to On and strength set to primary, ICU should ignore accents but take case into account. This is not the behavior I find in stringi package. To achieve that, I have to set locale = "", which is not what I expected. Examples:

This is ok:

stringi::stri_detect_coll("Mário", "mario", strength = 1L)
[1] TRUE

This should return FALSE:

stringi::stri_detect_coll("Mário", "mario", strength = 1L,case_level = TRUE)
[1] TRUE

In order to achieve that I need to set locale to "" on stringi:

stringi::stri_detect_coll("Mário", "mario", strength = 1L, locale = "")
[1] FALSE

Here I respect case but don't add accent and get the expected result:

stringi::stri_detect_coll("Mário", "Mario", strength = 1L, locale = "")
[1] TRUE
gagolews commented 3 years ago

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

gagolews commented 3 years ago

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE
jjesusfilho commented 3 years ago

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

$Unicode.version [1] "10.0"

$ICU.version [1] "61.1"

$Locale $Locale$Language [1] "pt"

$Locale$Country [1] "BR"

$Locale$Variant [1] ""

$Locale$Name [1] "pt_BR"

$Charset.internal [1] "UTF-8" "UTF-16"

$Charset.native $Charset.native$Name.friendly [1] "UTF-8"

$Charset.native$Name.ICU [1] "UTF-8"

$Charset.native$Name.UTR22 [1] NA

$Charset.native$Name.IBM [1] "ibm-1208"

$Charset.native$Name.WINDOWS [1] "windows-65001"

$Charset.native$Name.JAVA [1] "UTF-8"

$Charset.native$Name.IANA [1] "UTF-8"

$Charset.native$Name.MIME [1] "UTF-8"

$Charset.native$ASCII.subset [1] TRUE

$Charset.native$Unicode.1to1 [1] NA

$Charset.native$CharSize.8bit [1] FALSE

$Charset.native$CharSize.min [1] 1

$Charset.native$CharSize.max [1] 3

$ICU.system [1] FALSE

$ICU.UTF8 [1] FALSE

jjesusfilho commented 3 years ago

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE

Actually, what I want is to ignore accents, but not case. According to the ICU documentation mentioned above, I should achieve that by setting strength to level 1 and case level to On. I also work with PostgreSQL and it works accordingly, but with stringi, I can only get the same result by setting locale to '"", regardless of case_level's setting.

stringi::stri_detect_coll(c("Mário","mário"), "mario", strength=1L,  locale = '')
[1] FALSE  TRUE
gagolews commented 3 years ago

First of all, thanks, there was a bug; locale="" should mean locale=NULL, i.e., your own locale, pt_BR.

gagolews commented 3 years ago

instead, locale='' meant locale="POSIX" - that is why it worked as expected (and perhaps this is what postgresql uses, hence the correct results). I would recommend setting locale="POSIX" explicitly then.

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 1L, locale="POSIX")
## [1] FALSE  TRUE FALSE  TRUE
jjesusfilho commented 3 years ago

Yes, but I still believe that it can get a bit confusing because, in order to ignore accents but not case, in stringi, you have to set the locale, but not the case_level, when this is not the way PostgreSQL works and probably other languages, i.e, you must have the case level turned on along with the primary strength. This is also what the ICU documentation says:

"The case level is independent from the strength of comparison. It is possible to have a collator set to 
 primary strength with the case level turned on. This provides for comparison that takes into account the 
 case differences, while at the same time ignoring accents and tertiary differences other than case"
gagolews commented 3 years ago

Hmmm.... interestingly, a collator-based string comparison honours the above rule...

> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L)
[1] FALSE  TRUE FALSE FALSE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
[1] FALSE  TRUE FALSE  TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=1L)
[1] TRUE TRUE TRUE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=2L)
[1]  TRUE  TRUE FALSE FALSE
gagolews commented 3 years ago

I was trying hard to figure out why usearch returns a different result below, but with no success. A bug in ICU?

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] TRUE TRUE TRUE TRUE
stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] FALSE  TRUE FALSE  TRUE
gagolews commented 3 years ago

(note to self): ICU 69.1 gives the results as above. @TODO: create a minimal reproducible example outside of stringi

gagolews commented 3 years ago

[note to self]

Yes, this is reproducible outside of stringi:

/* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata  && ./a.out */

#include <unicode/ustring.h>
#include <unicode/rbbi.h>
#include <unicode/coll.h>
#include <unicode/ucol.h>
#include <unicode/stsearch.h>
#include <cstdio>
using namespace icu;

#define WITH_CHECK_STATUS(f) \
    status = U_ZERO_ERROR; \
    f; \
    if (U_FAILURE(status)) {printf("error %s!\n", u_errorName(status));return 1;}

int main()
{
    const char* haystacks[] = {
        "mario", "Mario", "MARIO", "MÁRIO", "Mário", "mário", "dario"
    };

    const char* needle = "mario";

    UErrorCode status;
    UCollator* col;
    UStringSearch* matcher;
    int v;

    printf("U_ICU_VERSION=%s\n", U_ICU_VERSION);

    WITH_CHECK_STATUS(col = ucol_open("pt_BR", &status))

    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_STRENGTH, UCOL_PRIMARY, &status))
    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_CASE_LEVEL, UCOL_ON, &status))

    UnicodeString _needle = UnicodeString::fromUTF8(needle);
    UnicodeString haystack = UnicodeString::fromUTF8("whatever");

    WITH_CHECK_STATUS(matcher = usearch_openFromCollator(
        _needle.getBuffer(),
        _needle.length(),
        haystack.getBuffer(),
        haystack.length(),
        col, NULL, &status
    ))

    for (int i=0; i<sizeof(haystacks)/sizeof(haystacks[0]); ++i) {
        printf("%s vs. %s: ", needle, haystacks[i]);
        haystack = UnicodeString::fromUTF8(haystacks[i]);

        WITH_CHECK_STATUS(
            usearch_setText(matcher, haystack.getBuffer(), haystack.length(), &status)
        )

        usearch_reset(matcher);

        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=usearch_first(matcher, &status)))
        printf(" usearch=%d", v);

        WITH_CHECK_STATUS(v = (int)UCOL_EQUAL==ucol_strcollUTF8(col,
            haystacks[i], -1, needle, -1, &status))
        printf(" ucol=%d", v);

        RuleBasedCollator* rbc = RuleBasedCollator::rbcFromUCollator(col);
        WITH_CHECK_STATUS(StringSearch matcher2(needle, haystack, rbc, NULL, status))
        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=matcher2.first(status)))
        printf(" StringSearch=%d", v);

        printf("\n");
    }

    ucol_close(col);
    usearch_close(matcher);

    return 0;
}

yields:

U_ICU_VERSION=67.1
mario vs. mario:  usearch=1 ucol=1 StringSearch=1
mario vs. Mario:  usearch=1 ucol=0 StringSearch=1
mario vs. MARIO:  usearch=1 ucol=0 StringSearch=1
mario vs. MÁRIO:  usearch=1 ucol=0 StringSearch=1
mario vs. Mário:  usearch=1 ucol=0 StringSearch=1
mario vs. mário:  usearch=1 ucol=1 StringSearch=1
mario vs. dario:  usearch=0 ucol=0 StringSearch=0
gagolews commented 3 years ago

All right, it turns out that this issue has already been reported. It is ICU-related.

https://unicode-org.atlassian.net/browse/ICU-21338

jjesusfilho commented 3 years ago

Thank you very much for taking your time to work this out. I have followed your thread posts and tried to find the issue on ICU's bug page, but couldn't figure it out.