question on accuracy metrics

jwijffels commented 6 years ago

Hello, I've got a question regarding accuracy metrics reported at https://spacy.io/models

Let's take as an example the reported accuracy for the French model built on the UD Sequoia Corpus available at https://spacy.io/models/fr It mentions pos: 94.52 uas: 87.16 las: 84.43 How are these calculated? Namely

Are they based on the dev or test data of UD_French-Sequoia
Is that POS number based on the known gold tokenisation or based on raw text followed by some form of alignment
Are the UAS and LAS numbers based on the known gold tokenisation and morphology or is there something else happening?
Are also F1 measures available as explained at http://universaldependencies.org/conll17/evaluation.html

I'm particularly interested in understanding the comparison to UDPipe baseline models which are reported here: http://ufal.mff.cuni.cz/udpipe/users-manual#udpipe_accuracy and which are built on the same Sequioa corpus upos: 96.8% uas: 88.7% las: 87.4%

honnibal commented 6 years ago

Results for the UD treebanks are currently reported using the development set. I've avoided running test set comparisons until we're using the UD evaluation scripts. Until we're using their scripts, the numbers aren't directly comparable anyway --- so there's no gain from peeking at the test set.
The POS number is based on gold tokenization, without gold sentence segmentation. Pseudo-paragraphs were created by concatenating 10 consecutive sentences. This is unideal, as the order of the sentences in the treebank are randomised --- so we're concatenating unrelated sentences.
The POS accuracy refers to joint prediction of the tag and morphological features. This creates an overly sparse learning problem, so it's not ideal. In future we plan to predict the morphological features separately, using binary classifiers with shared CNN weights.
The F1 measures are calculated using the same metric definitions as the CoNLL evaluation. Specifically, we use the same alignment procedure and exclude punctuation tokens from the evaluation. However, the scores were not produced using their evaluation software --- so minor differences may result.

Overall there's still a fair bit to do before we can publish a full table of numbers that allow direct comparison using the CoNLL methodology. The current numbers allow different spaCy models trained on the same corpus to be compared against each other, and do provide some indication of the current performance ballpark.

jwijffels commented 6 years ago

Thank you for the input on the datasets used and the POS evaluation. That's clear.

To be clear about the LAS and UAS F1 scores that spacy provides. Based on which of the following 2 is the score computed?

gold tokenised input + predictions from the tagger for the POS and morphology
gold tokenised input and known gold input regarding the POS and morphology in the development set of the Sequoia Corpus.

honnibal commented 6 years ago

If I understand the question correctly, I think the answer is 1.

The parser doesn't currently use any POS or morphology features. Even if it did, we'd definitely use the automatically predicted ones, not the gold-standard.

Note that the English evaluation results refer to automatically tokenized text. It's just that I don't have this set up for the UD treebanks yet.

jwijffels commented 6 years ago

Many thanks for the inputs. This is clear now!

jwijffels commented 6 years ago

FYI. I have tried to compare udpipe with spacy by evaluating how good both models performed on the test set of universal dependencies. For this I converted the output of spacy to conllu format in order to use the methodology explained at http://universaldependencies.org/conll17/evaluation.html to evaluate how good the models are. The results are available at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide feedback on these numbers.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

question on accuracy metrics #1856