Analyze: No word segmentation/incorrect tokenization for Chinese

Describe the bug The analyze module does not perform correct segmentation for Chinese texts. As Chinese does not have any white-space word segmentation, CATMA treats only punctuation symbols as word breaks. So a query like wild="人人" returns only matches if they are surrounded with whitespace or punctuation, which is not normally the case for Chinese texts. Queries like wild="%人人%" then return whole subclauses or phrases that are sandwiched between two punctuation marks, which is obviously also not ideal, as the match is not a "word containing the query" but rather a "sentence containing the query".

This issue also extends to the other analysis tools, such as KWIC, where the left/right contexts will consist of whole sentences instead of a couple words before/after the match.

To Reproduce Steps to reproduce the behavior:

Import Chinese text
Go to Analyze
Run Queries
Run KWIC

Expected behavior CATMA should not take whitespace as the word boundary delimiter for CJK scripts as it does with Latin script. Either it could use a proper Chinese segmentation tool, or it could use a single-character approach, where each Chinese character is treated as its own word.

Information about your environment

OS: Windows 10 Enterprise 22H2
Browser: Firefox 102.11.0esr (64-Bit)

Additional context A mature segmentation tool for Chinese would for example be Jieba

forTEXT / catma

Analyze: No word segmentation/incorrect tokenization for Chinese #332