LanguageMachines / frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
https://languagemachines.github.io/frog
GNU General Public License v3.0
73 stars 11 forks source link

fix language handling on plain text #62

Closed kosloot closed 5 years ago

kosloot commented 5 years ago

When provide with plain text in different languages, frog doesn't correctly ignore lines in an non-default language. Ucto DOES detect the language but this information is (sometimes?) ignored.

e.g an input file like:

Een regel in het Nederlands.

This isn't Dutch

Dit is geen Engels

utco happily assigns nld and eng to these sentences. But Frog handles all lines, even when the option --languages=nld is used, which of course isn't working out well for the English sentence.

The correct behavior would be to skip these lines, probably with minimal tabbed output, and in case of FoLiA output just add the sentence with the correct 'lang' tag.

kosloot commented 5 years ago

this is (at least partly) resolved now. But on mac's some things seem to fail still. Examining....

kosloot commented 5 years ago

The problems on Mac are caused by https://github.com/LanguageMachines/ucto/issues/62 for now we avoid them by testing on larger sentences. Everything is fixed now for the new_datastructure branch https://github.com/LanguageMachines/frog/commit/a31599884d0e04316a8b40b34ce8f543c29de487

kosloot commented 5 years ago

new_datastructure branch is merged into master. Also done