question on accuracy - Githubissues

LanguageMachines / frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

https://languagemachines.github.io/frog

GNU General Public License v3.0

75 stars 11 forks source link

question on accuracy #47

Closed jwijffels closed 4 years ago

jwijffels commented 6 years ago

Hello, this is not an issue, just a question. Basically on accuracy of the different NLP tasks. I'm interested in comparing different types of NLP annotators and their accuracy. How well does frog do regarding accuracy on tokenisation, parts of speech tagging, lemmatisation, morphological feature annotation, dependency parsing? Are there numbers available which are comparable to the CONLL17 shared task (for example by training the frog on Dutch data from universal dependencies and next outputting the results (for example by using the evaluation script used by the CONLL17 shared task available at https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py) Are such numbers available?

kosloot commented 6 years ago

hopefully @antalvdb can shine a light on this?

antalvdb commented 6 years ago

This paper from 2007 has the basic performance estimations for POS tagging, morphological analysis, and dependency parsing. The latter is computed by a predecessor of the CoNLL dependency parsing evaluator. The paper does not specify scores for the lemmatizer (but see this paper), the shallow parser / XP chunker (current score on test data: 91.3 precision, 92.5 recall, 91.9 F-score) or the named entity recognizer (current score on test data: overall F-score 82.1, persons 81.7, locations 90.9, organizations 75.1). The latter scores have not been published yet.

jwijffels commented 6 years ago

Thank you for links to the papers and the scores. That already provides some information.

I'm trying to compare frogto the model accuracies reported with udpipemodels (https://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_20_models_performance) which were either built on the UD_Dutch or UD_Dutch-LassySmall corpus (see http://universaldependencies.org/treebanks/nl-comparison.html or details on these 2 corpora). As the numbers reported in the papers are highly likely driven by the corpus used I wonder if there are accuracy related metrics (precision/recall/f/uas/las) scores available for a model which was also trained on these corpora from universaldependencies? Or is this wishfull thinking that someone would have done this?

antalvdb commented 6 years ago

Unfortunately we do not have the time to do these types of comparative evaluations. We always welcome anyone willing to put in time to do these types of exercises, and are happy to assist where possible.

Frog's parser is described in more detail here. The memory-based parser emulates the Alpino parser. Its inference (constraint satisfaction inference) is fast, but produces parses that are less accurate than those of Alpino. We trade accuracy for (predictable, relatively high) speed.

jwijffels commented 6 years ago

Thank you for the input. I understand completely that you don't have time for this. It's not a small task. I'm basically asking because recently I wrote an R wrapper around UDPipe (https://github.com/bnosac/udpipe) and I'm now investigating how good UDPipe is in comparison to other similar parsers, e.g. the Alpino parser or frog for Dutch, opennlp or the python pattern nlp package.

I was recently making a comparison between UDPipe & Spacy (https://github.com/jwijffels/udpipe-spacy-comparison) but I would like to have e.g. Frog added as well as the python pattern library, Alpino and OpenNLP. Do you know if such research has already been done so that maybe I can take a short-cut in this analysis.

proycon commented 6 years ago

As far as I know, it has not been done and would be a very welcome study..

jwijffels commented 6 years ago

Last week, I got into contact with Gertjan van Noord & Gosse Bouma as they had a paper on evaluating Alpino versus Parsey/Parseysaurus on the UD_Lassy-Small treebank http://aclweb.org/anthology/W17-0403 . I received the output of the Alpino results in CONLLU format which allowed a comparison to UDPipe. There were some nitty-gritty details on the evaluation but it already gave an indication of accuracy.

All it takes for making a comparison is providing the annotation result of some text for which we know the annotation in conllu output after which the evaluation script used by the CONLL17 shared task available at https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py can be used. But the tricky part is to get the annotation result in conllu format :)

kosloot commented 4 years ago

as this is not a REAL issue, I close this