Closed mpbraendle closed 4 months ago
There looks to be two issues here:
eprint__rindex.word
is utf8_bin
if this is utf8_general_ci
then if you search for Bzdusek
then you get a result even of the eprint__rindex.word
value is set to bzdušek
.bzdušek
for eprint__rindex.word
it tries to filter on bzdu
and ek
as the only indexed term is bzdušek
and the filter match exactly (not partial) then no results are found.So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use utf8_general_ci
collation rather than utf8_bin
, so this does not need to be manually fixed on new repositories. utf8_general_ci
collation is already used for the eprint__rindex
table but not for the word
amongst other fields it contains.
2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for eprint__rindex.word
.
I have debugged to some part.
Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word . For author names, it is called from MetaField::Name::get_search_conditions (for searching) and MetaField::Name::get_index_codes_basic. Due to some reason, the encoding seems not to be correct and apply_mapping doesn't transliterate. Also, apply_mapping is able to transliterate multiple characters into one and also has special rules. I doubt that utf8_general_ci will do that.
Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither.
You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin:
show full columns from eprint__rindex; +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+ | Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment | +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+ | eprintid | int(11) | NULL | NO | PRI | 0 | | select,insert,update,references | | | field | varchar(64) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | | | word | varchar(128) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | | +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+
There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/
Here a follow-up:
After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed. This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist. This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars). It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.
However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode
This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes. Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all. Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).
I think Text::Unidecode got removed from being packaged with EPrints because its was felt liable to be updated on a regular basis so was not an ideal module to package within the codebase. Looking at https://metacpan.org/dist/Text-Unidecode/changes it does not look like it is actually updated that much, so that concern may be moot. However, the general practice of packaging Perl modules in the codebase that are available in CPAN, seems like the wrong thing to do IMO. A better option is to work on ensuring https://github.com/eprints/eprints3.4/blob/HEAD/cpan_modules.pl is kept up to date, which is being look at in #373.
I agree that maintaining EPrints' own $EPrints::Index::FREETEXT_CHAR_MAPPING
seems like a time-consuming exercise, which is duplicating effort that someone else has already done a more comprehensive job. I am not sure if this is something that can be dealt with for the next release so I am going to punt this into be looked into in the next release. Unless you can provided specific advice on how to amend the code suitably to move to using Text::Unidecode
directly in the core codebase.
I have taken a closer look at this an inserting a unidecode
should negate the need for much of the $EPrints::Index::FREETEXT_CHAR_MAPPING
. However, it feels like it is best to leave this in for now even though theoretically it should have minimal effect, as anything it would have remapped will have already been done by by Text::Unidecode::unidecode
.
We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm
Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.
The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.
If we search with Bzdušek, the records are not found.
We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3
BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different: https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm
We have tried both versions, to no avail.
Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).