Open dchaplinsky opened 9 years ago
Hi, there. I have a question about 'lemma form' field in mappings.csv. What is it for? Why 'verb' has only 'inf' there and no gender, mood?
That's bare minimum of tags that lemma should have.
Check this out: https://github.com/dchaplinsky/LT2OpenCorpora/blob/master/convert.py#L142
I'll also redirect you an email from Marianna where she gave me that list.
by the way, what exactly anomalies are? is there a place where they are described? are they just wrong word-forms or tagsets?
Anything you can think of. Basically, it's more like a fishing in a moody water.
As I said, one such example is to count number of wordforms for each lemma and check the breakdown by POS. If adj usually has (say) 5 or 6 wordforms those rare cases where there are 2 forms or 17 should be investigated, etc.
Check commits d7187fb2b8fd661812658c55dec4547a755d6110 and 85d6f17e906b04ad35bd88ccd9d0ff9faaad4522 for another example.
At the moment we already doing such thing but it'd be nice to collect more.
Check how conversion script is using blinker to report some of stats and --debug flag.
Anomalies detection can include for example histograms of counts of wordforms for different POS. e.g. In 10000 cases VERBs has 14 wordforms, in 1000 cases — 16, and in 5 cases — 35. So we can easily spot suspicious things and report them to dictionary author.