parsing evaluation metrics

mheilman commented 10 years ago

We need some methods/scripts to evaluate parsing performance. We probably want to do two things: a) replicate previous work that uses parseval so that we can easily report previous results (see table 3 in http://www.cc.gatech.edu/~jeisenst/papers/ji-acl-2014.pdf), and b) implement a more appropriate metric based on precision/recall of relations between spans, not just precision/recall of (labeled or unlabled) spans as in parseval. See discussion from @sagae below.

The metrics should report unlabeled and labeled performance
The metrics should use the 18 coarse relations from Carlson et al.'s (2001) "Building a Discourse-tagged Corpus in the Framework of Rhetorical Structure Theory."
Discussion from @sagae

Looking at Fig 1 in http://www.isi.edu/~marcu/papers/sigdialbook2002.pdf, there are nine rhetorical relations, represented by the labeled directed arcs (same-unit is just a side effect of the annotation, and not a discourse relation). We really should be looking at precision and recall of the relations represented in these labeled arcs. So we would be looking for:

16 <- 17-26 : example
17-21 <- 22-26 : elaboration-additional
17-18 <- 19-21 : explanation-argumentative
22-25 <- 26 : consequence-s
17 <- 18 : attribution
19-20 <- 21 : attribution
19 <- 20 : elaboration-object-attribute-embedded
22 <- 23 : attribution-embedded
24 <- 25 : purpose

and precision and recall would be computed in the usual way, and successful identification of a relation requires the correct spans, the correct direction of the arrow, and the correct label. The list doesn't include 22-23 <- 24-25 : same-unit, but the parser does need to get this right to form the 22-25 span, so it's taken into account implicitly, which I think is the right way.

mheilman commented 10 years ago

Commit 12c5b59 implements the basic functionality for doing parseval, but it's not complete. Some edge cases still need to be dealt with (e.g., same-unit relations). See the TODO comments in the code.

mheilman commented 10 years ago

The paper about the HILDA system (http://dad.uni-bielefeld.de/index.php/dad/article/viewFile/591/1187) says to see Marcu, 2000, 143–144 for a discussion of how PARSEVAL was adapted. (I'm still waiting to get the book from interlibrary loan.)

Marcu, 2000 = The Theory and Practice of Discourse Parsing and Summarization

EducationalTestingService / rstfinder

parsing evaluation metrics #2

Discussion from @sagae