Closed jwijffels closed 6 years ago
Results for the UD treebanks are currently reported using the development set. I've avoided running test set comparisons until we're using the UD evaluation scripts. Until we're using their scripts, the numbers aren't directly comparable anyway --- so there's no gain from peeking at the test set.
The POS number is based on gold tokenization, without gold sentence segmentation. Pseudo-paragraphs were created by concatenating 10 consecutive sentences. This is unideal, as the order of the sentences in the treebank are randomised --- so we're concatenating unrelated sentences.
The POS accuracy refers to joint prediction of the tag and morphological features. This creates an overly sparse learning problem, so it's not ideal. In future we plan to predict the morphological features separately, using binary classifiers with shared CNN weights.
The F1 measures are calculated using the same metric definitions as the CoNLL evaluation. Specifically, we use the same alignment procedure and exclude punctuation tokens from the evaluation. However, the scores were not produced using their evaluation software --- so minor differences may result.
Overall there's still a fair bit to do before we can publish a full table of numbers that allow direct comparison using the CoNLL methodology. The current numbers allow different spaCy models trained on the same corpus to be compared against each other, and do provide some indication of the current performance ballpark.
Thank you for the input on the datasets used and the POS evaluation. That's clear.
To be clear about the LAS and UAS F1 scores that spacy provides. Based on which of the following 2 is the score computed?
If I understand the question correctly, I think the answer is 1.
The parser doesn't currently use any POS or morphology features. Even if it did, we'd definitely use the automatically predicted ones, not the gold-standard.
Note that the English evaluation results refer to automatically tokenized text. It's just that I don't have this set up for the UD treebanks yet.
Many thanks for the inputs. This is clear now!
FYI. I have tried to compare udpipe with spacy by evaluating how good both models performed on the test set
of universal dependencies.
For this I converted the output of spacy to conllu format in order to use the methodology explained at http://universaldependencies.org/conll17/evaluation.html to evaluate how good the models are.
The results are available at https://github.com/jwijffels/udpipe-spacy-comparison. Feel free to provide feedback on these numbers.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hello, I've got a question regarding accuracy metrics reported at https://spacy.io/models
Let's take as an example the reported accuracy for the French model built on the UD Sequoia Corpus available at https://spacy.io/models/fr It mentions pos: 94.52 uas: 87.16 las: 84.43 How are these calculated? Namely
I'm particularly interested in understanding the comparison to UDPipe baseline models which are reported here: http://ufal.mff.cuni.cz/udpipe/users-manual#udpipe_accuracy and which are built on the same Sequioa corpus upos: 96.8% uas: 88.7% las: 87.4%