explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.65k stars 4.36k forks source link

question on accuracy metrics #1856

Closed jwijffels closed 6 years ago

jwijffels commented 6 years ago

Hello, I've got a question regarding accuracy metrics reported at https://spacy.io/models

Let's take as an example the reported accuracy for the French model built on the UD Sequoia Corpus available at https://spacy.io/models/fr It mentions pos: 94.52 uas: 87.16 las: 84.43 How are these calculated? Namely

I'm particularly interested in understanding the comparison to UDPipe baseline models which are reported here: http://ufal.mff.cuni.cz/udpipe/users-manual#udpipe_accuracy and which are built on the same Sequioa corpus upos: 96.8% uas: 88.7% las: 87.4%

honnibal commented 6 years ago

Overall there's still a fair bit to do before we can publish a full table of numbers that allow direct comparison using the CoNLL methodology. The current numbers allow different spaCy models trained on the same corpus to be compared against each other, and do provide some indication of the current performance ballpark.

jwijffels commented 6 years ago

Thank you for the input on the datasets used and the POS evaluation. That's clear.

To be clear about the LAS and UAS F1 scores that spacy provides. Based on which of the following 2 is the score computed?

  1. gold tokenised input + predictions from the tagger for the POS and morphology
  2. gold tokenised input and known gold input regarding the POS and morphology in the development set of the Sequoia Corpus.
honnibal commented 6 years ago

If I understand the question correctly, I think the answer is 1.

The parser doesn't currently use any POS or morphology features. Even if it did, we'd definitely use the automatically predicted ones, not the gold-standard.

Note that the English evaluation results refer to automatically tokenized text. It's just that I don't have this set up for the UD treebanks yet.

jwijffels commented 6 years ago

Many thanks for the inputs. This is clear now!

jwijffels commented 6 years ago

FYI. I have tried to compare udpipe with spacy by evaluating how good both models performed on the test set of universal dependencies. For this I converted the output of spacy to conllu format in order to use the methodology explained at http://universaldependencies.org/conll17/evaluation.html to evaluate how good the models are. The results are available at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide feedback on these numbers.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.