Closed proycon closed 4 years ago
It seems they don't use explicitly set a tokeniser (neither by language nor by config) in https://github.com/GreekPerspective/glem/blob/master/glem/pretrained_models/herodotus/frog.cfg.template so I assume it defaults to dutch? I see the sets are left as is too, that's not good..
Setting an explicit tokconfig-generic in glem's frog.cfg seems to solve this.
In general, it would be helpful to have a clear ucto only proof of the problem. Now it might also be a glem or a frog issue.
But: Not having a tokenizer config in frog AT ALL should be signaled on the startup of frog,
(just as it does for the parser:
20191101:111809:556:Missing [[parser]] section in config file. 20191101:111809:556:Disabled the parser.
)
I am surprised to not see that in the log above....
The missing [[tokenizer]] section should put the tokenizer in passthru mode, NOT dutch.
Could you test by explicitly setting --skip=t on the frog command line?
Ok, so this seems to be a Frog problem. the passthru mode seems not to be set correctly when the [[tokenizer]] section is missing
Running with --skip=t
(which does set passthru) DOES work.
Quite spooky.
solved inUcto
Input: