gchrupala / morfette

Supervised learning of morphology
BSD 2-Clause "Simplified" License
28 stars 5 forks source link

Better error messages for incorrectly formatted input #21

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
From Djame:

Hi Grzegorz,
I've been able to reproduce this bug which occurs in two cases
1°  there're less than 2 fields of data per line (typical case : a punct which 
lacks a lemma in a treebank (PONCT .) instead of (PONCT .@.) so if one rewrites 
the leaf to get word lemma pos, there will be only tw o fields and bang.
2°  there're more than 3 fields  (typical case :  the (X (SYM @)) line in the 
PTB which is lemmatized @^@ but as a scriot which works for french and italian 
(tr "@" "^" | tr '^' '\t') will generate 4 fields
@^@ SYM -> ^^^ SYM > \t\t\tSYM  and bang morfette crashes  (one night it took 
me to catch on my own data)
Solutions:
1) the best  : make morfette more explicit (like display the faulting line and 
some context)
2) run a checker script
http://pauillac.inria.fr/~seddah/check.pl

Original issue reported on code.google.com by pitekus on 19 Dec 2011 at 3:26