First, I wanted to refresh the multi-sentence discussion, and drag it into the Github. I'm assuming and referencing Dan Marcou's proposal here -- would it be ok to post that here too?
Secondly, To continue the discussion, I wanted to flesh out this idea of doing that proposed kind of annotation in our current tool for document-level annotation, Anafora. I'm sure that if we were to sit down an design an editor for coreference we might be able to set something more fluid, but I wanted to show that most of the functionality we'd want is already in this tool, so it could be a good provisional way of testing things out. (My assumption is that we could annotate over AMRs like this, but that those annotations would be converted to the kind of in-line format proposed in Daniel's multi-sentence document). The first video shows a very simplistic version of just doing coreference over AMRs (not RED, just coreference):
RED/AMR part1 (3:28)
The next videos expand this with additional directions that we could consider going into. I'm framing this using the formalism for RED, which I think is a nice version of "ambitious" annotation, but there are clearly other discourse annotation directions that should be considered. The RED ideas is that alongside our annotation of coreference, we should be marking "document level" features like modality, tense, and event or entity status, and that we could even have a second stage marking causal and temporal relationships between events. This video shows how that would work over AMRs:
RED/AMR part2 (3:48)
Finally, I wanted to go further into the idea that with a stand-off tool, we could actually pre-annotate a bunch of these proposed features and just have annotators correct them. I've posted an additional videos for that (apologies for the low audio volume in these two):
RED/AMR part3 (3:27)
Finally, just to show how this would work over a hard domain like the biomedical data, I posted an annotation of about half the data in that original multi-sentence document. This is a bit bumbling (this looks like hard data to handle) but I'd hope it shows that even pretty hard domains wouldn't be all that hard to do handle with something like Anafora/RED:
RED/AMR part4 (4:36)
This is partly to just keep the conversation going on these, and to present the RED idea. I'll hopefully have a "converter" soon to spit these annotations out into an inline AMR format too; when I do, I'll post some annotations here.
I wanted to put up some discussion points in case we get to multi-sentence issues this week.
The main claim in the videos above is that we should consider a tool designed for multi-sentence annotation, like Anafora.
This would assume we annotate in that tool and then immediately convert to a within-AMR format of some kind.
Eight mentions of the same entity, in Anafora, requires two button presses and eight clicks. Speed/ease might matter a lot with this.
If people are interested, I may be able to set up accounts and documents for them.
The secondary claim is that we should talk more about possible richer, related annotations. There is a big overhead to understanding the document and its coreference relations that we would already be paying, it might make sense to talk about doing other important document-level things.
Importantly, there are 75 DEFT-relevant documents that have (or will have) RED relations (including coreference). It would be an important first step to get those documents in the AMR queue; this is free multi-sentence AMR data, if so.
The theoretical details of the "we should consider things beyond coreference" idea being:
We don't currently encode realicity, generic/actual, polarity, whether something is a real event on a timeline (does a fight event occur whenever we mention firefighters?), etc.
Annotations like "generic vs actual" help a lot in controlling coreference quality.
Event relations -- causality, temporal order and containment, event substructure -- could also be added here as well.
For many of these (and even coreference itself), many things are "somewhat"y deterministic. With RED annotation on Anafora, simple color coding has been a good solution to doing this, but I'd mainly claim that some kind of pre-annotation capability (however implemented) could be very important.
First, I wanted to refresh the multi-sentence discussion, and drag it into the Github. I'm assuming and referencing Dan Marcou's proposal here -- would it be ok to post that here too?
Secondly, To continue the discussion, I wanted to flesh out this idea of doing that proposed kind of annotation in our current tool for document-level annotation, Anafora. I'm sure that if we were to sit down an design an editor for coreference we might be able to set something more fluid, but I wanted to show that most of the functionality we'd want is already in this tool, so it could be a good provisional way of testing things out. (My assumption is that we could annotate over AMRs like this, but that those annotations would be converted to the kind of in-line format proposed in Daniel's multi-sentence document). The first video shows a very simplistic version of just doing coreference over AMRs (not RED, just coreference): RED/AMR part1 (3:28) The next videos expand this with additional directions that we could consider going into. I'm framing this using the formalism for RED, which I think is a nice version of "ambitious" annotation, but there are clearly other discourse annotation directions that should be considered. The RED ideas is that alongside our annotation of coreference, we should be marking "document level" features like modality, tense, and event or entity status, and that we could even have a second stage marking causal and temporal relationships between events. This video shows how that would work over AMRs: RED/AMR part2 (3:48) Finally, I wanted to go further into the idea that with a stand-off tool, we could actually pre-annotate a bunch of these proposed features and just have annotators correct them. I've posted an additional videos for that (apologies for the low audio volume in these two): RED/AMR part3 (3:27) Finally, just to show how this would work over a hard domain like the biomedical data, I posted an annotation of about half the data in that original multi-sentence document. This is a bit bumbling (this looks like hard data to handle) but I'd hope it shows that even pretty hard domains wouldn't be all that hard to do handle with something like Anafora/RED: RED/AMR part4 (4:36)
This is partly to just keep the conversation going on these, and to present the RED idea. I'll hopefully have a "converter" soon to spit these annotations out into an inline AMR format too; when I do, I'll post some annotations here.