UCL / frrant

2 stars 0 forks source link

Feature/search diacriticals #431

Closed acholyn closed 6 months ago

acholyn commented 7 months ago

closes #377 and #362

Plain text versions of text objects have the unicode characters stripped out (non ascii) so a text search of "plain text" should return results matching "Õh Ṭọ hάvë pļāĭñ těxt" as the diacriticals etc are ignored.

acholyn commented 6 months ago

The searches for the Greek characters are converted to their translated version eg. α is a , from what I could tell, is that not helpful? Do they enter Greek characters in the search?

tcouch commented 6 months ago

Richard's original comment in #377 mentions using Greek characters in the search. Like I say, we could solve it by converting the search term using unidecode as well, but I feel like the function I suggested above which just strips out diacritics without converting to ASCII would be safer.

acholyn commented 6 months ago

I just realised unidecode converts everything to ascii which means searches for greek characters wouldn't work. Either we'd have to convert search terms to ascii as well, or use another method such as this one to remove diacritics:

def remove_combining_fluent(string: str) -> str:
    normalized = unicodedata.normalize('NFD', string)
    return ''.join(
        [l for l in normalized if not unicodedata.combining(l)]
    ).casefold()

Also, we need to write a migration to recreate plain content for all the relevant text fields using the new approach.

what does the NFD mean in this code?

tcouch commented 6 months ago

I just realised unidecode converts everything to ascii which means searches for greek characters wouldn't work. Either we'd have to convert search terms to ascii as well, or use another method such as this one to remove diacritics:

def remove_combining_fluent(string: str) -> str:
    normalized = unicodedata.normalize('NFD', string)
    return ''.join(
        [l for l in normalized if not unicodedata.combining(l)]
    ).casefold()

Also, we need to write a migration to recreate plain content for all the relevant text fields using the new approach.

what does the NFD mean in this code?

Normal Form D. From the documentation for unicodedata:

...there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

So, it splits a character with a diacritic into two parts. The next part of that function then discards any parts that are combining characters.