UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Discrepancy between UD state of the art and dependency parsing state of the art #849

Closed LifeIsStrange closed 2 years ago

LifeIsStrange commented 2 years ago

I cannot find English leaderboards on your website. The only place I can find results is on the de facto standard site https://paperswithcode.com/sota/dependency-parsing-on-universal-dependencies The current SOTA (from 2019..) has 84% accuracy which is very bad and not useable by the industry.. Compare this with the Pen treebank which has achieved 96.26% accuracy!! There is no contest, universal dependencies needs much better accuracy to be relevant. However i only care about English accuracy (like most of the industry), please tell me there are more up to date leaderboards on UD English accuracy. If not then it's time for someone to build an XLnet transformer fine tuned on UD. Please answer, I really need to know.

@myavrum friendly ping

dan-zeman commented 2 years ago

The Penn Treebank is not a dependency treebank. You cannot compare constituent parsing scores with dependency parsing scores, that's like mixing apples and oranges (also, it is unclear what "accuracy" should mean in constituent parsing). Even if you have the same annotation scheme, evaluation of the same model on different datasets typically gives significantly different scores.

Some leaderboards can be found at the websites of the various parsing shared tasks on UD data (for example, here you can see 90.83% from the CoNLL 2018 shared task) but to the best of my knowledge, there is no continuously maintained and updated leaderboard.

But if you find the Penn Treebank better for your purposes, feel free to stay with it.

amir-zeldes commented 2 years ago

Based on the number cited, I think @LifeIsStrange is asking about scores on the dependency converted version of PTB, for which Mrini et al. (2019) reported LAS around 96 (so, not constituents, but also not manually annotated UD). However you should note that:

  1. PTB is single domain (WSJ) and probably does not generalize well outside of newswire ca. 1990
  2. The conversion is deterministically done using CoreNLP and produces substantially simpler trees than manually annotated UD

For example, CoreNLP const2dep converted data has no non-projective trees, which we know are actually fairly common in English. Also, certain distinctions not captured in PTB trees are lost. For example, proper names like "James A. Talcott" are analyzed as right to left compound structures, rather than as flat left to right names which is clearly wrong: (((James/compound) A./compound) Talcott/); and similarly all nested noun structures are just right-to-left. There are many other issues. In short, the 96% score is due to a much easier problem and less correct, non-manually annotated dependencies.

As for actual UD English scores using transformers, Trankit offers a reasonable idea of out-of-the-box biaffine parser scores:

https://trankit.readthedocs.io/en/latest/performance.html

But those numbers reflect older UD corpora and are just with Roberta XLM (so not per-language embeddings). If you're willing to use complex setups (and Mrini et al. is also not simple), our own best score on English is LAS/UAS of 92.16/94.25 on the UD GUM corpus (12 genres, incl. conversation, web, fiction...), which used some pretty complex model stacking by training on GUM, EWT, OntoNotes and more multilayer annotations, see here:

https://github.com/gucorpling/amalgum

An older version of the pipeline is described in this paper:

https://aclanthology.org/2020.lrec-1.648.pdf