Open mheilman opened 10 years ago
Commit 12c5b59 implements the basic functionality for doing parseval, but it's not complete. Some edge cases still need to be dealt with (e.g., same-unit relations). See the TODO comments in the code.
The paper about the HILDA system (http://dad.uni-bielefeld.de/index.php/dad/article/viewFile/591/1187) says to see Marcu, 2000, 143–144 for a discussion of how PARSEVAL was adapted. (I'm still waiting to get the book from interlibrary loan.)
Marcu, 2000 = The Theory and Practice of Discourse Parsing and Summarization
We need some methods/scripts to evaluate parsing performance. We probably want to do two things: a) replicate previous work that uses parseval so that we can easily report previous results (see table 3 in http://www.cc.gatech.edu/~jeisenst/papers/ji-acl-2014.pdf), and b) implement a more appropriate metric based on precision/recall of relations between spans, not just precision/recall of (labeled or unlabled) spans as in parseval. See discussion from @sagae below.
Discussion from @sagae
Looking at Fig 1 in http://www.isi.edu/~marcu/papers/sigdialbook2002.pdf, there are nine rhetorical relations, represented by the labeled directed arcs (
same-unit
is just a side effect of the annotation, and not a discourse relation). We really should be looking at precision and recall of the relations represented in these labeled arcs. So we would be looking for:and precision and recall would be computed in the usual way, and successful identification of a relation requires the correct spans, the correct direction of the arrow, and the correct label. The list doesn't include
22-23 <- 24-25 : same-unit
, but the parser does need to get this right to form the22-25 span
, so it's taken into account implicitly, which I think is the right way.