Gathering stats to find anomalies in dict

dchaplinsky / LT2OpenCorpora

Python script to convert ukrainian morphological dictionary to OpenCorpora format. Script runs well under PyPy and also collects some stats/insights/anomalies in the dicts. Use on your own risk.

MIT License

12 stars 10 forks source link

Gathering stats to find anomalies in dict #6

Open dchaplinsky opened 9 years ago

dchaplinsky commented 9 years ago

At the moment we already doing such thing but it'd be nice to collect more.

Check how conversion script is using blinker to report some of stats and --debug flag.

Anomalies detection can include for example histograms of counts of wordforms for different POS. e.g. In 10000 cases VERBs has 14 wordforms, in 1000 cases — 16, and in 5 cases — 35. So we can easily spot suspicious things and report them to dictionary author.

igor-tytyk commented 9 years ago

Hi, there. I have a question about 'lemma form' field in mappings.csv. What is it for? Why 'verb' has only 'inf' there and no gender, mood?

dchaplinsky commented 9 years ago

That's bare minimum of tags that lemma should have.

Check this out: https://github.com/dchaplinsky/LT2OpenCorpora/blob/master/convert.py#L142

I'll also redirect you an email from Marianna where she gave me that list.

igor-tytyk commented 9 years ago

by the way, what exactly anomalies are? is there a place where they are described? are they just wrong word-forms or tagsets?

dchaplinsky commented 9 years ago

Anything you can think of. Basically, it's more like a fishing in a moody water.

dchaplinsky commented 9 years ago

As I said, one such example is to count number of wordforms for each lemma and check the breakdown by POS. If adj usually has (say) 5 or 6 wordforms those rare cases where there are 2 forms or 17 should be investigated, etc.

dchaplinsky commented 9 years ago

Check commits d7187fb2b8fd661812658c55dec4547a755d6110 and 85d6f17e906b04ad35bd88ccd9d0ff9faaad4522 for another example.