UAlbertaALTLab / itwewina

Replaced by https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
https://github.com/UAlbertaALTLab/cree-intelligent-dictionary
GNU General Public License v3.0
1 stars 0 forks source link

Check validity of dictionary when starting up #94

Open eddieantonio opened 5 years ago

eddieantonio commented 5 years ago

Ensure mistakes like #93 don't happen again.

Basically, do a few sanity checks before starting the app:

EDIT: I'm pretty sure I can validate a lot of these things by creating an XML schema, and using a schema validator, but... that might be more effort than it's worth.

Print warnings on start up that are LOUDLY logged somewhere.

aarppe commented 5 years ago

I had noticed this in a few cases. Currently, the reason is that some of the comparative matches/mismatches between CW and MD result from the descriptive analysis allowing for two inflected forms belonging to different parts of speech (and lemmas with different parts of speech). Typically, we have an non-base inflected form lexical entry in MD, which matches with a base-form lexical entry in CW. In such a case one would need to different lexical entries. E.g.

MD: atos MD: Have him do something for you. CW: atos+N+AN+Sg CW: atos CW: arrow CW: atos COMP:lemma

This should be resolvable, but may require some thinking on how to produce the appropriate POS and LC info for the MD entries (which doesn't have CW-style LC:s, so I'd have to extract that through linking the lemma from the correct FST analysis of the MD entry with the LC in CW for the lemma.).

aarppe commented 5 years ago

WIth some new scripting, dictionary entries end up being matched only if they belong to the same part-of-speech (by exclusion of 'conjugation' class in MD vs. CW comparisons) - the conjugated/inflected forms as now output as separate MD-only dictionary entries. So the 'atos' issue above is no longer an 'issue'.

aarppe commented 5 years ago

Checking validity of XML source and delivering warnings of bad structure to appropriate location/email is a very desirable feature.