I've got most of the MS-AMR release ready -- all completed files have been checked against the latest snapshot from Ulf, and checked for a range of issues - overlapping identity chains, changes in what they refer to, and even checking against the speaker IDs I've extracted from the ERE data to make sure that chains with "i" are consistent. If anyone has ideas for additional things to test, let me know!
I've had a format I've been using for a few months, but have been trying to hammer out an easy, interpretable format for this. A given document would have a simple name like "msamr-dfb-023.gold.xml", and have two sections. The first would be a decaration of what the "document" is -- a list of the AMRs in a document, and the speaker and post IDs when available:
Any suggestions? We want this to feel as obvious and as easy to understand as possible. Some questions:
We have the post ID, but don't add additional structure describing more detais of threaded discussions -- such as when one AMR is re-quoted. Do we need it?
Three very specific wikification errors are still problematic (identity chains having multiple wiki links, where one is definitely wrong):
ldcpreferred bolt-eng-DF-200-192451-5796283_0090.8 (snt. 103 in workset dfb-0248): "Promised_Land" should probably be "-" (it's "Pakistan" in context)
cjconsensus DF-200-192448-618_9851.11 (snt. 11 in workset dfa-wset-56): "Ozzy_Osbourne" should actually be "Sharon_Osbourne" in context.
cjconsensus wb.eng_0009.23 (snt. 23 in workset wb-eng-0009): f / family :wiki "Michael_Jackson" should be f / family :wiki "Jackson_family"
Current status:
set documents amrs
WB 16 812
DF(LDC) 62 2689
DFB(UCO) 49 2056
DFA(UCO) 139 2163
total 266 7720
MS-AMR format and release
I've got most of the MS-AMR release ready -- all completed files have been checked against the latest snapshot from Ulf, and checked for a range of issues - overlapping identity chains, changes in what they refer to, and even checking against the speaker IDs I've extracted from the ERE data to make sure that chains with "i" are consistent. If anyone has ideas for additional things to test, let me know!
I've had a format I've been using for a few months, but have been trying to hammer out an easy, interpretable format for this. A given document would have a simple name like "msamr-dfb-023.gold.xml", and have two sections. The first would be a decaration of what the "document" is -- a list of the AMRs in a document, and the speaker and post IDs when available:
Then the identity chains are just explicitly marked as links between variables in each AMR document:
Finally, we can encode set/member and part/whole relations, and any AMR variables they refer to that aren't in the coreference chains:
Any suggestions? We want this to feel as obvious and as easy to understand as possible. Some questions:
Current status: set documents amrs WB 16 812 DF(LDC) 62 2689 DFB(UCO) 49 2056 DFA(UCO) 139 2163 total 266 7720