General Search - name with accent

cdewaele commented 3 years ago

in webtrees 2.0.10 demo, there are individuals whose names include accents, for example :

Augustus the Younger, Duke of Brunswick-Lüneburg
Ferdinand Albert II, Duke of Brunswick-Wolfenbüttel

In the 'General search' page, when search for Luneburg (without '¨') in the Families records, we can find married individuals whose names include Lüneburg. But when whe search for Luneburg (without '¨') in the Individuals records we can't find anything.

This is due to : https://github.com/fisharebest/webtrees/blob/708e66987f7d6eed5675fea31f4074192e368cac/app/Services/SearchService.php#L1105

Perhaps one way to fix the problem should be adding something like this before mb_stripos ?

$rule = 'NFD; [:Nonspacing Mark:] Remove; NFC'; $myTrans = Transliterator::create($rule); $gedcom = $myTrans->transliterate($gedcom); $search_term = $myTrans->transliterate($search_term);

ric2016 commented 3 years ago

See also #3584 for a somewhat related issue (a search where there is a difference between individual and family search).

The fix proposed there would probably fix this issue as well, at least for names. For searches targeting other parts of the GEDCOM, rawGedcomFilter may actually have to be adjusted as proposed here.

fisharebest commented 3 years ago

What is the expectation of a user who's language uses accents?

Searching for u will find ü and u? Searching for 'ü' will find ü but not u? Searching for ü will find ü and ue?

cdewaele commented 3 years ago

The expectation is : searching for u will find ü and u For example in french, some first names begin with É, some people will write E instead of É searching for Emile will find Émile and Emile searching for Émile will also find Émile and Emile

hartenthaler commented 3 years ago

For German I would expect:

Searching for u will find u, but finding ü is acceptable
Searching for 'ü' will find ü but not u, but finding u is acceptable
Searching for ü will find ü and ue
Searching for ue will find ue and ü
... the same for a, A, o, O und U
Searching for ss will find ß and vice versa

ric2016 commented 3 years ago

It seems that this requires a locale-aware variant of stripos etc (for cases where a search isn't executed directly as a database query using a specific collation).

We have a collator already in I18N.php (self::$collator), but unfortunately there aren't any methods for substring matching (these have been proposed but haven't been included yet).

So we may have to implement a locale-aware variant of stripos ourselves, based on $collator->compare (see here for a similar suggestion). This probably won't be very efficient though.

fisharebest commented 3 years ago

We need to search/filter twice.

1) using SQL to find records 2) using PHP to exclude matches in certain non-genealogy fields.

I guess a better solution might be to add a new column to the database containing just the searchable text. We can remove accents and convert to lower case before storing it in the database.

This would make 1) faster and eliminate 2)

The search uses the MySQL locale - so for German we should match ue with ü (but not for other languages).

However,

I am not sure it is working properly
the PHP filtering will not match it

Norwegian-Sardines commented 3 years ago

Greg Said: "The search uses the MySQL locale - so for German we should match ue with ü (but not for other languages)."

I'm not sure what this actually means to me! I use English in my browser (other could use their native language), but I may enter a name with Norwegian, German, Danish, or Swedish letters (ø, ö, å, æ, ß, Etc) each has its own English equivalent. A person may have migrated to an English speaking place and used these equivalent when they began to reside there. For example: the Norwegian first name Bård, could be written in English as Baard or Bard. In German the surname Müller could be in English Muller or Mueller.

So if we entered all of the various combinations of accented names in the database, since we are in the habit of entering the name as it was found in the source! Would we get the right data returned if we searched for Bard or Muller in English but the names were entered as Bård or Baard, or Müller or Mueller?

fisharebest commented 3 years ago

I'm not sure what this actually means

It means that searching with "German collation rules" should give differerent results to searching with "International collation rules". e.g.

mysql> SELECT 'ü' = 'ue' COLLATE utf8mb4_german2_ci;
+----------------------------------------+
| 'ü' = 'ue' COLLATE utf8mb4_german2_ci  |
+----------------------------------------+
|                                      1 |
+----------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT 'ü' = 'ue' COLLATE utf8mb4_unicode_ci;
+----------------------------------------+
| 'ü' = 'ue' COLLATE utf8mb4_unicode_ci  |
+----------------------------------------+
|                                      0 |
+----------------------------------------+
1 row in set (0.00 sec)

fisharebest / webtrees

General Search - name with accent #3646