eprints / eprints3.4

EPrints 3.4 core and releases
http://www.eprints.org/uk/index.php/eprints-3-4/
GNU Lesser General Public License v3.0
31 stars 28 forks source link

Index/Tokenizer problem (RHEL 8, perl 5.26) #374

Closed mpbraendle closed 4 months ago

mpbraendle commented 8 months ago

We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm

Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.

The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek.

If we search with Bzdušek, the records are not found.

We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3

BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different: https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm

We have tried both versions, to no avail.

Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).

drn05r commented 8 months ago

There looks to be two issues here:

  1. The collation for eprint__rindex.word is utf8_bin if this is utf8_general_ci then if you search for Bzdusek then you get a result even of the eprint__rindex.word value is set to bzdušek.
  2. If the search term contains accented characters rather than filtering onbzdušek for eprint__rindex.word it tries to filter on bzdu and ek as the only indexed term is bzdušek and the filter match exactly (not partial) then no results are found.

So 1 can be fixed by hand but some investigation is needed to ensure this table is created differently in future to use utf8_general_ci collation rather than utf8_bin, so this does not need to be manually fixed on new repositories. utf8_general_ci collation is already used for the eprint__rindex table but not for the word amongst other fields it contains.

2 I expect will require a code change because there must be something this treats accented characters as a separator rather than a valid character that may appear in a value for eprint__rindex.word.

mpbraendle commented 8 months ago

I have debugged to some part.

Point 1 should be irrelevant - Index::Tokenizer::apply_mapping should take care that only ASCII characters are stored in the word . For author names, it is called from MetaField::Name::get_search_conditions (for searching) and MetaField::Name::get_index_codes_basic. Due to some reason, the encoding seems not to be correct and apply_mapping doesn't transliterate. Also, apply_mapping is able to transliterate multiple characters into one and also has special rules. I doubt that utf8_general_ci will do that.

Point 2 is not clear, if apply_mapping works as expected from MetaField::Name::get_index_codes_basic, that shouldn't be relevant, neither.

You are correct, that a form of utf8_bin is used currently, in our case it is utf8mb3_bin:

show full columns from eprint__rindex; +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+ | Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment | +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+ | eprintid | int(11) | NULL | NO | PRI | 0 | | select,insert,update,references | | | field | varchar(64) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | | | word | varchar(128) | utf8mb3_bin | NO | PRI | | | select,insert,update,references | | +----------+--------------+-------------+------+-----+---------+-------+---------------------------------+---------+

ajmetz commented 5 months ago

There is a lot I still have to learn about EPrints, and indeed Unicode. Am wondering why the need to have character mapping in the first place. Is it helpful to know that diacritic(accent) insensitive matching/comparison is possible with Perl Core / Perl's Standard Modules? https://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html/

mpbraendle commented 5 months ago

Here a follow-up:

After two full days of debugging and trying out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed. This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist. This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars). It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

However, we applied a solution now that we also use in cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes. Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all. Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

drn05r commented 4 months ago

I think Text::Unidecode got removed from being packaged with EPrints because its was felt liable to be updated on a regular basis so was not an ideal module to package within the codebase. Looking at https://metacpan.org/dist/Text-Unidecode/changes it does not look like it is actually updated that much, so that concern may be moot. However, the general practice of packaging Perl modules in the codebase that are available in CPAN, seems like the wrong thing to do IMO. A better option is to work on ensuring https://github.com/eprints/eprints3.4/blob/HEAD/cpan_modules.pl is kept up to date, which is being look at in #373.

I agree that maintaining EPrints' own $EPrints::Index::FREETEXT_CHAR_MAPPING seems like a time-consuming exercise, which is duplicating effort that someone else has already done a more comprehensive job. I am not sure if this is something that can be dealt with for the next release so I am going to punt this into be looked into in the next release. Unless you can provided specific advice on how to amend the code suitably to move to using Text::Unidecode directly in the core codebase.

drn05r commented 4 months ago

I have taken a closer look at this an inserting a unidecode should negate the need for much of the $EPrints::Index::FREETEXT_CHAR_MAPPING. However, it feels like it is best to leave this in for now even though theoretically it should have minimal effect, as anything it would have remapped will have already been done by by Text::Unidecode::unidecode.