TALP-UPC / FreeLing

FreeLing project source code
Other
253 stars 96 forks source link

Ruby API: utf8::invalid_code_point #83

Closed andreaslillebo closed 5 years ago

andreaslillebo commented 5 years ago

Running the following code:

require './freeling'

Freeling::Util.init_locale('default')

FLDIR = '/usr/local/share/freeling'.freeze
LANG = 'en'.freeze

tokenizer = Freeling::Tokenizer.new("#{FLDIR}/#{LANG}/tokenizer.dat")
splitter = Freeling::Splitter.new("#{FLDIR}/#{LANG}/splitter.dat")
session = splitter.open_session

maco_options = Freeling::Maco_options.new(LANG)
maco_options.set_data_files(
  '',
  "#{FLDIR}/common/punct.dat",
  "#{FLDIR}/#{LANG}/dicc.src",
  "#{FLDIR}/#{LANG}/afixos.dat",
  '',
  "#{FLDIR}/#{LANG}/locucions.dat",
  "#{FLDIR}/#{LANG}/np.dat",
  "#{FLDIR}/#{LANG}/quantities.dat",
  "#{FLDIR}/#{LANG}/probabilitats.dat"
)

morphological_analyzer = Freeling::Maco.new(maco_options)

# activate mmorpho odules to be used in next call
morphological_analyzer.set_active_options(
  false, # umap - User Map
  true,  # num -  Number Detection
  true,  # pun -  Punctuation Detection
  true,  # dat -  Dates Detection
  true,  # dic -  Dictionary Search (also splits words)
  true,  # aff -  Affix (?)
  false, # comp - Compounds (?)
  true,  # rtk -  Retokenization (?)
  true,  # mw -   Multiword Recognition
  true,  # ner -  Named Entity Recognition
  true,  # qt -   Quantity Recognition
  true   # prb -  Probability Assignment and Unknown Word Guesser
)

# create tagger and sense anotator,
tagger = Freeling::Hmm_tagger.new("#{FLDIR}/#{LANG}/tagger.dat", true, 2)
sense_labeler = Freeling::Senses.new("#{FLDIR}/#{LANG}/senses.dat")

text = '2£'

tokens = tokenizer.tokenize(text)
sentences = splitter.split(session, tokens, true)
sentences = morphological_analyzer.analyze(sentences)
sentences = tagger.analyze(sentences)
sentences = sense_labeler.analyze(sentences)

sentences.each do |sentence|
  sentence.each do |word|
    puts word.get_form + ' ' + word.get_lemma + ' ' + word.get_tag + ' ' + word.get_senses_string
  end
  puts ' '
end

splitter.close_session(session)

Results in the following output:

2 2 Z
terminate called after throwing an instance of 'utf8::invalid_code_point'
  what():  Invalid code point
Aborted (core dumped)

When the input text contains any multi-byte character, and get_form or get_lemma is called on the instance of Freeling::Word referencing the multi-byte character, it throws a 'utf8::invalid_code_point' error.

It seems like each byte (8 bits) of the multi-byte character is threated as a seperate character, as the sentence in the above example contains 3 "words":

text.bytes.map(&:chr)
=> ["2", "\xC2", "\xA3"]

0xC2 and 0xA3 are indeed invalid in utf-8.

Also worth noting; only outputting the tag for each word in the above example:

puts word.get_tag

Which prints out:

Z
Fz
Fz

According to the user manual (https://talp-upc.gitbook.io/freeling-4-1-user-manual/tagsets/tagset-en), Fz coresponds to:

Fz | pos:punctuation;   type:other

lluisp commented 5 years ago

I am not familiar with Ruby, but in the case of Python 2, the problem is that it does not handle UTF8 by default. Even if you assign a UTF8 string to a variable, it will be stored in separate bytes. Thus, you need to encode and decode the strings to make sure they are UTF8 before sending them to FreeLing. So, the line: text = '2£' will work on python3, but in python2 would need to do something like text = '2£'.decode('utf8')

Not sure if Ruby needs something similar...

However, as you say, it is quite likely that Ruby support in SWIG is not as complete regarding to utf and wstrings, hence the need to hack the generated API.

andreaslillebo commented 5 years ago

Thanks for the info.

I'll try and see if I can find a way to make it work.