LanguageMachines / frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
https://languagemachines.github.io/frog
GNU General Public License v3.0
73 stars 11 forks source link

Token annotation error for XML output with non-standard rules #82

Closed marijnschraagen closed 4 years ago

marijnschraagen commented 5 years ago

Maybe related to https://github.com/LanguageMachines/frog/issues/80?

When using XML output with non-standard rules there is a token-annotation error. Command: frog -t myfile.txt -X myresult.xml --language=nld-vnn

Output:

frog 0.19 (c) CLTS, ILK 1998 - 2019
CLST  - Centre for Language and Speech Technology,Radboud University
ILK   - Induction of Linguistic Knowledge Research Group,Tilburg University
based on [ucto 0.19, libfolia 2.4, timbl 6.4.14, ticcutils 0.23, mbt 3.5]
removing old debug files using: 'find frog.*.debug -mtime +1 -exec rm {} \;'
frog-:config read from: /usr/local/share/frog/nld-vnn/frog.cfg
frog-:Missing [[mbma]] section in config file.
frog-:Disabled the Morhological analyzer.
frog-:Missing [[IOB]] section in config file.
frog-:Disabled the IOB Chunker.
frog-:Missing [[NER]] section in config file.
frog-:Disabled the NER.
frog-:Missing [[mwu]] section in config file.
frog-:Disabled the Multi Word Unit.
frog-:Also disabled the parser.
frog-mblem-:Initiating lemmmmatizer...
ucto: textcat configured from: /usr/local/share/ucto/textcat.cfg
frog-tok-:Language List =[nld-vnn]
ucto: No useful settingsfile(s) could be found (initiating from language list: [nld-vnn])
frog-tagger-tagger-:reading subsets from /usr/local/share/frog/nld-vnn//babsub.cgn
frog-tagger-tagger-:reading constraints from /usr/local/share/frog/nld-vnn//babconstraints.cgn
frog-:Thu Sep 12 19:09:35 2019 Initialization done.
frog-:Thu Sep 12 19:09:35 2019 Frogging myfile.txt
[first sentence processed ok, removed here]

Word(class='WORD-COMPOUND',generate_id='myfile.txt.p.1.s.1',
set='tokconfig-nld-vnn',space='no') creation failed: DeclarationError:
Set 'tokconfig-nld-vnn' is used but has no declaration for token-annotation

The regular column-based output works without any problems.

proycon commented 5 years ago

I can indeed replicate this. It seems related to LanguageMachines/ucto#72 .

kosloot commented 5 years ago

Well.... The problem is here that frog uses the 'language' nld-vnn which refers to the configuration in /usr/local/share/frog/nld-vnn/ Ucto is then initialized from /usr/local/share/frog/nld-vnn/frog.cfg using:

[[tokenizer]]
rulesFile=tokconfig-nld-historical

So for ucto the language is nld-historical

This is confusing for us as well the software....

When I run Frog like this: frog -c /usr/local/share/frog/nld-vnn/frog.cfg -X uit.xml -t txt all seem well. So that might be a quick workaround.

As a matter of fact, I am inclined to think that this is an abuse of the --language parameter. It is meant to give frog a hint about the languages to detect, and NOT to tell which configuration to use.

When using --languages, frog should ignore the rulesFile information from the frog config file. This was so until @proycon "fixed" it in #80 That was putting the cart before the horse probably.

We need to rethink this.

kosloot commented 4 years ago

fixed according to #80