Open martinpopel opened 2 years ago
Thanks for looking into this - you are right, the MIN annotation was implemented a little last minute, and there seem to be bugs in the implementation.
That said, MIN has several purposes, and I do not think it is useless, so unless I hear a compelling reason to move away from it, I expect we will keep it in the future:
Concretely about the bugs:
the implementation currently only rules out named possessors
So we would have "Bob's Library", but "Library of Bob"?
That said, MIN has several purposes
I acknowledge these are valid purposes. I just think that we could achieve the same (or better) by including the head and all its flat
and conj
children (transitively, so conj->flat grandchild would be included as well).
I acknowledge these are valid purposes.
Yes, I must say I am like you and I also didn't immediately understand these issues, but some users feel very strongly about their usefulness!
I just think that we could achieve the same (or better) by including the head and all its flat and conj children (transitively, so conj->flat grandchild would be included as well).
Unfortunately (and for some applications fortunately) UD analyzed names with regular syntax compositionally, so for the "lookup in table of authorities" scenario (and arguably corresponding prioritization in fuzzy matching), this would not be sufficient. But for non-named entities, the implementation is indeed close to what you are saying.
Hi everybody
I am not sure about the context of this discussion, but being asked, let me make some general points and then get to Amir's questions
like Amir said, the reason for having a MIN is to allow for partial credit when the entire NP is used as a markable;
I don't remember the principles behind the definition of MIN in MUC and ACE, but the MIN in ARRAU is definitely meant to be the head of the NP
indeed, we have been consulting the UD guidelines trying to be as consistent as possible with those
except that, as some of the points you raise make clear, there are a number of cases which require adding on to the traditional notion of head / a few cases, in particular coordination, in which the notion of head defined in UD as we understand it is clearly inappropriate for anaphora (I am convinced that ultimately a `soft' definition of head like defined by Lee et al is the way to go)
regarding proper names, the idea is that the entire proper name is the MIN, but parentheticals / appositions are not included. Using <> to indicate the MIN,
- [
and indeed, in Amir's example
- [our friend <Kim Zhang, the famous singer who was catapulted to fame last year
coming to Amir's questions
[
`University Library of the Vrije Universiteit Amsterdam'
then indeed we would have
[
(this is how we did [
if I remember correctly we did have a couple of discussions about allowing discontinuous MINs but it was so rare that we never did it even though technically it would be possible. At any rate the point is somewhat moot until we get a UA scorer able to handle discontinuous mentions (or discontinuous MINs)
Massimo
On 07/01/2022 14:25, Amir Zeldes wrote:
Thanks for looking into this - you are right, the MIN annotation was implemented a little last minute, and there seem to be bugs in the implementation.
That said, MIN has several purposes, and I do not think it is useless, so unless I hear a compelling reason to move away from it, I expect we will keep it in the future:
- MIN is used in fuzzy matching for both NER and coreference resolution scoring. Systems that do not recover the exact boundaries of a mention but do contain the MIN span can receive partial credit in a fuzzy evaluation, and this allows corpora like GUM to participate in a task track with fuzzy scoring
- In training systems, it is possible to use a loss function or policy gradient that penalizes missing tokens in MIN spans more severely
- In many practical applications of NER, users may want to know the core lexical region of a mention, and this has been pointed out to me by industry users as well. Mentions such as "our friend Kim Zhang, the famous singer who was catapulted to fame last year" do not make it immediately obvious that the 'name' part is just "Kim Zhang"; but for many applications with dynamic authority tables (e.g. a list of this month's people of interest), it is a very common strategy to decide if a document is interesting by looking up precisely such names, verbatim, in a table. In many scenarios, MIN spans can be used for lookup in a knowledge base.
Concretely about the bugs:
- I wonder if we truly want to eliminate discontinuous MIN, since it is easy to derive the contiguous span, but this gives more information. The idea to insist on contiguity is only an attempt to emulate ARRAU (which, surprisingly, has discontinuous mention spans! But not MIN...)
- Inclusion of named entity with named appos - I don't actually know the intended ARRAU policy here. What do you think @poesio https://github.com/poesio ? What is the correct MIN span of "the National Library of the Netherlands (Koninklijke Bibliotheek)"?
- "I am not sure why Vrije Universiteit Amsterdam is included in MIN" - I misspoke, actually the implementation currently only rules out named possessors. Arguably "University Library of the Vrije Universiteit Amsterdam" is correct, as names are meant to be captured in their entirety, and this is its name (but again @poesio https://github.com/poesio may be able to elucidate this)
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1007445204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD6Z7PTPNDLS47VN3QGLUU3ZUHANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Thanks for the quick feedback! Just a few questions:
[<the National Library of the Netherlands> ([Koninklijke Bibliotheek])]
?allowing discontinuous MINs but it was so rare that we never did it
And what about all the coordinations where the second conjunct has some non-PROPN left-children? If MIN span has to be there and coordinations are annotated as a single mention, I actually prefer "my cat and your dog" over "my cat and your dog".
MIN is to allow for partial credit when the entire NP is used as a markable
Yes, but we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span.
number of cases which require adding on to the traditional notion of head
So can we find MIN span using a deterministic algorithm given a mention and its (UD) dependency tree (and all other mentions in the sentence)?
May I also ask what is the MIN span in the following mentions according to the GUM/ARRAU guidelines?
we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span
The syntactic head is not always the MIN span:
[The city of <London>]
[her majesty <Queen Elizabeth>]
And depending on whether we like quantity nouns maybe:
[A number of <people>]
There are also inverted head constructions:
[A hell of a <day>]
I am also a little doubtful we can get these 100% automatically, but with some rules and lists, and given that we know about nested entities (at least in corpora with full nested mentions, such as ARRAU or GUM) we may be able to get this fairly close to gold just using rules, and maybe a classifier to spot suspicious cases.
I am also curious for Massimo's response, since I am a newcomer to MIN annotation!
The syntactic head is not always the MIN span
I know about many cases in ARRAU where MIN span does not include the UD-syntactic head (my list of mentions is based on such research and our paper), but I was not fully sure if (in which cases)
In the last case I would expect MIN spans in UD would be changed according to the UD dependency trees because Massimo wrote :we have been consulting the UD guidelines trying to be as consistent as possible with those".
The reason why I expected MIN span should include the head in general is because Massimo wrote "MIN in ARRAU is definitely meant to be the head of the NP" and "there are a number of cases which require adding on to the traditional notion of head", which I (perhaps wrongly) interpreted as that the UD head is always included in MIN, but sometimes more words need to be added.
General point: the general idea for proper names was to mark proper names, not proper nouns; and to mark whatever the full proper name is, which need not consist of proper nouns only - so that 'the sun' and 'the pope' would also be considered proper names. (The original inspiration, from the time of GNOME, was Loebner's notion of proper name, but recently there has been more work on the distinction between proper name and proper noun, see e.g., the recent survey by Schluecker and Ackermann 2017.) The problem is that it's not always so obvious what counts as `the full propername', and we haven't found complete proposals, so we've been relying primarily on Quirk and Greenbaum's 'A University Grammar of English' (sections 4.23-4.30) and the Stanford Encyclopedia of Philosophy. Plus of course I'm not sure that all of my annotators always had the notion of proper name very clear ;-) Is there a UD-specific treatment?
Answers to specific questions inline
On 08/01/2022 18:38, Amir Zeldes wrote:
Thanks for the quick feedback! Just a few questions:
- Can you explain why "the" is included in MIN for |[<the National Library of the Netherlands> ([Koninklijke Bibliotheek])] |?
Because 'the' is part of the proper name? (as in 'The United States of America'?)
- What would you do with these cases: o We went to [Bob's Starbucks]
[<Bob's Starbucks>] if the possessive is part of the proper name, else [Bob's
] * o It was donated by [the Andy Warhol Institute]
see above
* o The statue is exhibited at [Ripley's Believe It or Not!]
[<Ripley's Believe It or Not!>]
*
- If you would like discontinuous MIN in the future, I'm happy to implement it in GUM (or rather, it seems we forgot to implement the opposite ;)
Great!
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008099038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD63IPWCCKMSIICRKDYLUVCABRANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
On 09/01/2022 00:07, Martin Popel wrote:
allowing discontinuous MINs but it was so rare that we never did it
And what about all the coordinations where the second conjunct has some non-PROPN left-children? If MIN span has to be there and coordinations are annotated as a single mention, I actually prefer /"my cat and your dog"/ over /"my cat and your dog"/.
Sorry I am not sure I follow you Martin -
- for NP coordinations (as opposed to NPs with coordinated head nouns) in the end we decided to NOT mark a min because as far as I know the most common view is that coordinated NPs do not have a head. (NB in GNOME we had used the conjunction itself as head)
so in that example we would mark the heads of the coordinated NPs, but no head for the coordination:
/[[[my] <cat>] and [[your] <dog>]]/
// PS am I right that in UD, John would be the head of John and Mary?/ /
MIN is to allow for partial credit when the entire NP is used as a markable
Yes, but we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span.
Indeed. and in most cases, in ARRAU we do use the head as min. and in most cases the UD head is also the ARRAU min - coordinated NPs should be the only exception I think
number of cases which require adding on to the traditional notion of head
So can we find MIN span using a deterministic algorithm given a mention and its (UD) dependency tree (and all other mentions in the sentence)?
Indeed - see e.g. the paper Nafise Moosavi, Michael Strube and myself wrote for ACL 2019
May I also ask what is the MIN span in the following mentions according to the GUM/ARRAU guidelines?
- one of the Russell Group universities
[
of [the Russell Group ]]
even though we generally go for an NP rather than a DP analysis, in the case of partitive constructions we use the determiner
- President Carter
- Mr. Simmons
- vitamin C
in all these cases, the entire proper name is also the MIN
- $25 million
[<$> 25 million] (we have lots of these cases in ARRAU_WSJ
- most analysts
[most
] (as I said above, we generally go for the NP analysis ... )
- most of the analysts
[
of [the ]] .... which unfortunately means these two NPs have different mins.
- half the trust’s total of $268 million
[
[[the <trust’s>] of [<$> 268 million]]]
- April 9, 2007
[<April 9, [<2007>] >]
- a year later
[a
] later
- more than $1.6 trillion
[more than <$>1.6 trillion]
- all of this
[
of [ ]]
- former chairman, Howard Weichern
[former chairman, [
]]
- 3 % to 10 %
[[3 <% >] to [10 <%>]] percentages is actually one of the cases we find more problematic and have discussed several times over the years
- a 50 % stake
[a 50 %
]
- the exploration side of the unit
[the exploration
of [the ]]
- the rate banks charge each other on overnight loans
[the
banks charge each other on overnight loans] not sure what is the problem here? — Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008189984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD65TXSEGADW4B6VUZPTUVDGT7ANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
On 09/01/2022 17:43, Amir Zeldes wrote:
we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span
The syntactic head is not always the MIN span:
- [The city of ]
- [her majesty ]
Sorry, what would be the UD head here? In ARRAU we mark
[The city of [
And depending on whether we like quantity nouns maybe:
- [A number of ]
We do treat some complex determiners as determiners but not all:
[between 4 and 5
There are also inverted head constructions:
- [A hell of a ]
good example, I would probably suggest treating "a hell of" as a determiner but I'm not sure
I am also a little doubtful we can get these 100% automatically, but with some rules and lists, and given that we know about nested entities (at least in corpora with full nested mentions, such as ARRAU or GUM) we may be able to get this fairly close to gold just using rules, and maybe a classifier to spot suspicious cases.
I am also curious for Massimo's response, since I am a newcomer to MIN annotation!
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008342959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD623F7DCWVQ5EF57HT3UVHCNHANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
What would be the UD head in those cases?
On 09/01/2022 22:08, Martin Popel wrote:
The syntactic head is not always the MIN span
I know about many cases in ARRAU where MIN span does not include the UD-syntactic head (my list of mentions is based on such research and paper), but I was not fully sure if (in which cases)
- this is a systematic difference between heads and MIN,
- these are annotation errors,
- the ARRAU annotators had different trees (with different heads) in their minds.
In the last case I would expect MIN spans in UD would be changed according to the UD dependency trees because Massimo wrote /:we have been consulting the UD guidelines trying to be as consistent as possible with those"/.
The reason why I expected MIN span should include the head /in general/ is because Massimo wrote /"MIN in ARRAU is definitely meant to be the head of the NP"/ and /"there are a number of cases which require adding on to the traditional notion of head"/, which I (perhaps wrongly) interpreted as that the UD head is always included in MIN, but sometimes more words need to be added.
— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008433129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD65F5H2WNKVD6JLND5DUVIBM3ANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
In UD I'm pretty sure "number" and "hell" would be the heads.
I think there are a number of differences between the choices above and what GUM currently does, but GUM's current MIN is probably buggy anyway, and we do not have a long tradition of this either. Some thoughts:
I actually prefer "my cat and your dog" over "my cat and your dog":
Sorry I am not sure I follow you Martin -
This (public) discussion may be better viewed at GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44 with Markdown formatting, so that MIN span in my example is marked with bold.
so in that example we would mark the heads of the coordinated NPs, but no head for the coordination:
[[[my] <cat>] and [[your] <dog>]]
So coordination mention should have no MIN span in ARRAU? But it should always have embedded mentions of all the conjuncts, which have their MIN spans. So we could define (or automatically fill) coordination MIN span based on the union of its conjuncts' MIN spans (which will be discontinuous in many cases). OK, this makes sense.
am I right that in UD, John would be the head of John and Mary?
Yes. The first conjunct is the head of coordinations in UD.
in most cases, in ARRAU we do use the head as min and in most cases the UD head is also the ARRAU min - coordinated NPs should be the only exception I think
There are many other mismatches. We wrote a paper about it.
In ARRAU 2.1 LDC, wsjarrau_0617_markable_level.xml
, I see markable_280
with span="word_1468..word_1469" min_ids="word_1468..word_1469"
, which means in "false claims" both words of the mention are the MIN span. Why? I don't think "false claims" is a proper name.
Maybe, we should ignore all the markable_level.xml
files, as these don't include the coreference anyway. So our paper is based on the coref_level.xml
files, instead. Unfortunately, these files are not consistent in ARRAU. For example, in wsjarrau_0617_coref_level.xml
, there is no mention with span="word_1468..word_1469"
, but there is a mention with id="markable_873" span="word_1466..word_1478" min_ids="word_1469" min_words="claims"
corresponding to "fraud and false claims in connection with a logistics-computer contract for the Army", which has wrong span (the coordination is not "fraud and claims", but "charges and claims") and also strange MIN span, whereas "claims" is the second conjunct.
- President Carter
the entire proper name is also the MIN
OK, but in "our president, Jimmy Carter", only "Jimmy Carter" would be the MIN, am I right?
BTW: UD treats these two differently as well, but the head is always the president: "President Carter" is a flat structure (all words depend on the first word with deprel=flat
, or flat:name
in this case), but in "our president, Carter", "Carter" depends on "president" with deprel=appos
. See a paragraph about “fixed/close" vs. "loose/wide" apposition.
[<$> 25 million]
(we have lots of these cases in ARRAU_WSJ
Yes, but according to the ARRAU coref_level.xml
files, the MIN span is "million" in "$ 25 million".
And also I see "President Carter" and "Mr. Simmons 's".
It seems these are not exceptions (rare annotation errors), but a systematic divergence between the coref_level.xml
and markable_level.xml
files. Why is ARRAU using two different guidelines for annotating mentions and MIN?
(Perhaps, we should not discuss it here, in the GUM GitHub issues, but I am not aware of any ARRAU repo with public issues.)
[most <analysts>]
[<most> of [the <analysts>]]
.... which unfortunately means these two NPs have different mins
They have a different head in UD (same as the MIN you marked), so that's OK from my point of view.
That said, I've found "most analysts" in wsjarrau_1110_coref_level.xml
. I could understand reasons for such decision (similarity to "most/majority of"), but again, I would like to see guidelines for such MIN annotation.
[<half> [[the <trust’s>] <total>of [<$> 268 million]]]
So the MIN of the outer mention is "half". In UD, "total" is the head of this phrase and "half" depends on it with deprel=det:predet
. So this is another mismatch between ARRAU's MIN and UD's head.
[a <year>] later
But according to vpc_0766_coref_level.xml
, it is "a year later" (i.e. "later" is part of the mention)
and according to vpc_0766_markable_level.xml
, there is no mention at all.
I could not find UD guidelines on how to parse such phrase.
English-GUM has two occurrences of "year later" with a different head each time.
[more than <$>1.6 trillion]
But according to wsjarrau_0692_coref_level.xml
, it is "more than $1.6 trillion"
and according to wsjarrau_0692_markable_level.xml
, the whole mention is MIN.
[<all> of [<this>]]
But according to coref_level.xml
, "this" is the MIN of the whole (outer) mention.
[[3 <% >] to [10 <%>]]
percentages is actually one of the cases we find more problematic and have discussed several times over the years
But according to coref_level.xml
, the second "%" is the MIN of the whole mention.
In UD, the first conjunct is the head of coordination, so the first "%" is the head of the whole mention.
[a 50 % <stake>]
But according to coref_level.xml
, "%" is the MIN.
[the exploration <side> of [the <unit>]]
But according to coref_level.xml
, "exploration" is the MIN.
[the <rate> banks charge each other on overnight loans]
But according to coref_level.xml
, "banks" is the MIN.
I am sorry for the longish list of mismatches, but I think it illustrates nicely my feeling that the MIN annotation guidelines are far from clear and that there are many disagreements even within ARRAU ('s annotators).
There are many phenomena (most vs most of, dates, currencies, percentages, coordinations, nested coordinations,...) for which it is difficult to define head/MIN.
I think it could be easier to just adopt the decisions in UD, instead of working on parallel guidelines.
If there is a need for annotating more than a single head word, we could have a deterministic algorithm which adds other conjuncts (all deprel=conj
children) and the whole proper name (this is slightly more difficult, depending on the exact definition of "proper name", but we can start with adding all deprel=flat
children),
I agree this could be clearer, and essentially I would be happiest if we could more or less predict this from the trees plus POS tags (which indicate proper names) and nested entities. Taken together, they offer a lot of information, and give us the opportunity to make MINs follow what I take to be the three main principles:
If we want to build some exceptions on top of that for semantically weak heads (a number/couple/bunch/sort/kind of...) then that's fine too, and of course we could iteratively improve the algorithm.
I've learned from @amir-zeldes that GUM attempts to reproduce ARRAU's guidelines for MIN (minimal span of coreference mentions):
So these rules can be (and maybe were) implemented fully automatically given the dependency tree.
In GUM_academic_librarians-8, I see
Entity=(organization-41-new-2,3,6,8,9,14,15,18,19,20
meaning "the National Library of the Netherlands (Koninklijke Bibliotheek), and the University Library of the Vrije Universiteit Amsterdam" (MIN span marked in bold).deprel=appos
, i.e. Koninklijke Bibliotheek in MIN.organization-6
, i.e. a different entity thanorganization-41
.In GUM_academic_librarians-16, I see
Entity=(abstract-92-new-4,8,9-sgl
meaning "a four step approach with a Working Out Loud-principle".My main question is what is the purpose of such MIN annotation? Why not include just the single head word for those who need a simple solution (so they don't need to extract the head from the dependency tree). Those with special needs can access the dependency tree and decide themselves whether to include PROPN/NNP/appos/prepositions/articles/conjuncts/etc for a given purpose.