apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Unicode lemma of tags-item in TSX file does not work #77

Closed Fred-Git-Hub closed 3 years ago

Fred-Git-Hub commented 4 years ago

https://github.com/apertium/lttoolbox/blob/0285babcb7ad1bb86c9b7d88c7b1db90de96b6c9/lttoolbox/pattern_list.cc#L127

result.push_back(int((unsigned char) lemma[i])); should be result.push_back(int((wchar_t) lemma[i]));

Otherwise, Unicode lemma of tags-item in TSX file will not work.

[Test case] unicode.tsx

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="unicode">
   <tagset>
      <def-label name="unicode" closed="true">
         <tags-item lemma="아" tags="noun"/>
      </def-label>
   </tagset>
</tagger>

In case of unsigned char:

$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
Warning: There is not coarse tag for the fine tag '아<noun>'
         This is because of an incomplete tagset definition or a dictionary error
^아/아<noun>$

In case of wchar_t:

$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
^아/아<noun>$

Same is true for apertium-tagger.

This issue is copied from https://github.com/apertium/lttoolbox/commit/76287d2f2e495d626be7200ea85f7dd712adbb84#commitcomment-36208453

unhammer commented 4 years ago

Thank you for the report and fix, great that you included a reproducible test case :-) It looks good to me, but I think someone who knows more about the tagger should take a look too.

ftyers commented 4 years ago

@unhammer, I think that's basically only @sanmarf and @jimregan. I'd say it looks like a pretty applyable fix.

jimregan commented 4 years ago

LGTM

TinoDidriksen commented 3 years ago

Is still still relevant and should be applied?

Fred-Git-Hub commented 3 years ago

I expected the core development team to review this part of code and fix a bug at an appropriate release timing.

sanmarf commented 3 years ago

Hi,

Before going ahead, install apertium-tagger-training-tools and use apertium-tagger-readwords to make sure it works.

Regards Felipe

unhammer commented 3 years ago

@TinoDidriksen the test case is still reproducible, and the patch still changes it in the way we want. Jim said it looked good, so applied.