Open pcdi opened 1 year ago
@pcdi Just a quick note to say thank you for your recent submissions here! We're currently quite distracted by the release of CATMA 7 and associated tasks, but hopefully we can take a look at these issues in the not-too-distant future.
Describe the bug The analyze module does not perform correct segmentation for Chinese texts. As Chinese does not have any white-space word segmentation, CATMA treats only punctuation symbols as word breaks. So a query like
wild="人人"
returns only matches if they are surrounded with whitespace or punctuation, which is not normally the case for Chinese texts. Queries likewild="%人人%"
then return whole subclauses or phrases that are sandwiched between two punctuation marks, which is obviously also not ideal, as the match is not a "word containing the query" but rather a "sentence containing the query".This issue also extends to the other analysis tools, such as KWIC, where the left/right contexts will consist of whole sentences instead of a couple words before/after the match.
To Reproduce Steps to reproduce the behavior:
Expected behavior CATMA should not take whitespace as the word boundary delimiter for CJK scripts as it does with Latin script. Either it could use a proper Chinese segmentation tool, or it could use a single-character approach, where each Chinese character is treated as its own word.
Information about your environment
Additional context A mature segmentation tool for Chinese would for example be Jieba