Ruby API: utf8::invalid_code_point

andreaslillebo commented 5 years ago

Running the following code:

require './freeling'

Freeling::Util.init_locale('default')

FLDIR = '/usr/local/share/freeling'.freeze
LANG = 'en'.freeze

tokenizer = Freeling::Tokenizer.new("#{FLDIR}/#{LANG}/tokenizer.dat")
splitter = Freeling::Splitter.new("#{FLDIR}/#{LANG}/splitter.dat")
session = splitter.open_session

maco_options = Freeling::Maco_options.new(LANG)
maco_options.set_data_files(
  '',
  "#{FLDIR}/common/punct.dat",
  "#{FLDIR}/#{LANG}/dicc.src",
  "#{FLDIR}/#{LANG}/afixos.dat",
  '',
  "#{FLDIR}/#{LANG}/locucions.dat",
  "#{FLDIR}/#{LANG}/np.dat",
  "#{FLDIR}/#{LANG}/quantities.dat",
  "#{FLDIR}/#{LANG}/probabilitats.dat"
)

morphological_analyzer = Freeling::Maco.new(maco_options)

# activate mmorpho odules to be used in next call
morphological_analyzer.set_active_options(
  false, # umap - User Map
  true,  # num -  Number Detection
  true,  # pun -  Punctuation Detection
  true,  # dat -  Dates Detection
  true,  # dic -  Dictionary Search (also splits words)
  true,  # aff -  Affix (?)
  false, # comp - Compounds (?)
  true,  # rtk -  Retokenization (?)
  true,  # mw -   Multiword Recognition
  true,  # ner -  Named Entity Recognition
  true,  # qt -   Quantity Recognition
  true   # prb -  Probability Assignment and Unknown Word Guesser
)

# create tagger and sense anotator,
tagger = Freeling::Hmm_tagger.new("#{FLDIR}/#{LANG}/tagger.dat", true, 2)
sense_labeler = Freeling::Senses.new("#{FLDIR}/#{LANG}/senses.dat")

text = '2£'

tokens = tokenizer.tokenize(text)
sentences = splitter.split(session, tokens, true)
sentences = morphological_analyzer.analyze(sentences)
sentences = tagger.analyze(sentences)
sentences = sense_labeler.analyze(sentences)

sentences.each do |sentence|
  sentence.each do |word|
    puts word.get_form + ' ' + word.get_lemma + ' ' + word.get_tag + ' ' + word.get_senses_string
  end
  puts ' '
end

splitter.close_session(session)

Results in the following output:

2 2 Z
terminate called after throwing an instance of 'utf8::invalid_code_point'
  what():  Invalid code point
Aborted (core dumped)

When the input text contains any multi-byte character, and get_form or get_lemma is called on the instance of Freeling::Word referencing the multi-byte character, it throws a 'utf8::invalid_code_point' error.

It seems like each byte (8 bits) of the multi-byte character is threated as a seperate character, as the sentence in the above example contains 3 "words":

text.bytes.map(&:chr)
=> ["2", "\xC2", "\xA3"]

0xC2 and 0xA3 are indeed invalid in utf-8.

Also worth noting; only outputting the tag for each word in the above example:

puts word.get_tag

Which prints out:

Z
Fz
Fz

According to the user manual (https://talp-upc.gitbook.io/freeling-4-1-user-manual/tagsets/tagset-en), Fz coresponds to:

Fz | pos:punctuation;   type:other

Running FreeLing from the command line works without problems. The same goes for the python3 API.
The manual (https://talp-upc.gitbook.io/freeling-4-1-user-manual/installation/calling-freeling-library-from-languages-other-than-c++/apis-linux#apis-requirements) mentions: "You can use Python2, but if you have encoding problems, we told you so...". Perhaps this is related.
The hack in #81 makes changes to the setter method of what seems to be an iterator over multi-byte character strings. Perhaps this is what it broke (though I've tried removing the whole setter method, and it recompiles and runs without any other errors).
These lines:https://github.com/TALP-UPC/FreeLing/blob/94667d48939cb610584cebad63d5c09349c0d53f/APIs/ruby/freeling_rubyAPI.i#L46-L101 seem to focused on some string to wide-string conversion. Perhaps those lines is an attempt to solve this very issue?

lluisp commented 5 years ago

I am not familiar with Ruby, but in the case of Python 2, the problem is that it does not handle UTF8 by default. Even if you assign a UTF8 string to a variable, it will be stored in separate bytes. Thus, you need to encode and decode the strings to make sure they are UTF8 before sending them to FreeLing. So, the line: text = '2£' will work on python3, but in python2 would need to do something like text = '2£'.decode('utf8')

Not sure if Ruby needs something similar...

However, as you say, it is quite likely that Ruby support in SWIG is not as complete regarding to utf and wstrings, hence the need to hack the generated API.

andreaslillebo commented 5 years ago

Thanks for the info.

I'll try and see if I can find a way to make it work.

TALP-UPC / FreeLing

Ruby API: utf8::invalid_code_point #83