How best to extend indexing for different languages and scripts on AndBible

AndBible / and-bible

AndBible: Bible Study

https://andbible.org

GNU General Public License v3.0

585 stars 195 forks source link

How best to extend indexing for different languages and scripts on AndBible #3273

Open MarkLee196 opened 3 months ago

MarkLee196 commented 3 months ago

Texts in different languages and scripts sometimes require handling in different ways, be it rendering or indexing, and the open source nature of AndBible means that it is better sorted than many apps to do this. AndBible already has significant capacity for rendering less well supported scripts with features such as customizable fonts, but regarding indexing, the existing documentation regarding lucene indexing is limited so it is not easy to say what is already possible, also features such as customizable lucene indexes as not yet implemented. Thoughts and ideas on this matter are invited below.

MarkLee196 commented 3 months ago

I have confirmed by testing that changing the language in the configuration file can in some cases change the analyzer used by lucene to build the index. Though in most cases the same analyzer seemed to b be used, the configuration "LANG=zh-Hans" aka Chinese produced a notably different index. It seems that lucene was using the 'zh' (Zhongwen aka Chinese) part of the language code to choose which analyzer, to use rather that the script part 'Hans' (Hanzi Simplified aka simplified CJKV characters)., It would be useful for module producers to know which values for LANG call specific analyzers.

MarkLee196 commented 3 months ago

There are at 3 ways that it would be possible to implement customized indexing. (1) by implementing the use of custom lucene indexes for modules (2) by the use of custom lucene analysers, see for example here (3) by implementing the use of non-lucene searches (this is what diatheke the command line frontend to sword has as an option. Of course implementing more than one way to customize indexes and searching would be better than just one.

MarkLee196 commented 3 months ago

At present the issue with lucene indexes on AndBible that I have not yet been able to solve by using different values for LANG in the configuration files is the handling of surrogate characters, U+10000 and above. It seems that the analyzers are unable to deal with these properly. removing part of the string so that in some cases searching for a string including surrogate characters does not match, as what is stored in the index is partial, and in other cases when the search string includes surrogate characters extra false matches are produced. This suggests that the solution is for the analyzer to use the appropriate tokenizer..

tuomas2 commented 1 month ago

Maybe upgrading Lucene (https://github.com/AndBible/jsword/pull/16) could help?