Closed andreaslillebo closed 5 years ago
I am not familiar with Ruby, but in the case of Python 2, the problem is that it does not handle UTF8 by default. Even if you assign a UTF8 string to a variable, it will be stored in separate bytes. Thus, you need to encode and decode the strings to make sure they are UTF8 before sending them to FreeLing. So, the line: text = '2£' will work on python3, but in python2 would need to do something like text = '2£'.decode('utf8')
Not sure if Ruby needs something similar...
However, as you say, it is quite likely that Ruby support in SWIG is not as complete regarding to utf and wstrings, hence the need to hack the generated API.
Thanks for the info.
I'll try and see if I can find a way to make it work.
Running the following code:
Results in the following output:
When the input text contains any multi-byte character, and
get_form
orget_lemma
is called on the instance ofFreeling::Word
referencing the multi-byte character, it throws a 'utf8::invalid_code_point' error.It seems like each byte (8 bits) of the multi-byte character is threated as a seperate character, as the sentence in the above example contains 3 "words":
0xC2
and0xA3
are indeed invalid in utf-8.Also worth noting; only outputting the tag for each word in the above example:
Which prints out:
According to the user manual (https://talp-upc.gitbook.io/freeling-4-1-user-manual/tagsets/tagset-en), Fz coresponds to: