UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Usefulness of MIN annotation #44

Open martinpopel opened 2 years ago

martinpopel commented 2 years ago

I've learned from @amir-zeldes that GUM attempts to reproduce ARRAU's guidelines for MIN (minimal span of coreference mentions):

  • For a simple NP, take the head
  • For a coordinate NP, take the heads of all coordinated nested NPs; because MIN should be contiguous, make the MIN span start at the first head and end at the last
  • For proper NPs, take the head proper noun and all other proper nouns dominated by it, provided that they do not belong to a different entity:
    • For "the Federal Bureau of Investigations", MIN is 2-5, because "the" is not proper, but "Investigations" is tagged NNP, and "of" must be included due to contiguity
    • For "Bob 's mother Helen", MIN=4, because "Bob" is part of a different entity, and the rest is not NNP

So these rules can be (and maybe were) implemented fully automatically given the dependency tree.

In GUM_academic_librarians-8, I see Entity=(organization-41-new-2,3,6,8,9,14,15,18,19,20 meaning "the National Library of the Netherlands (Koninklijke Bibliotheek), and the University Library of the Vrije Universiteit Amsterdam" (MIN span marked in bold).

In GUM_academic_librarians-16, I see Entity=(abstract-92-new-4,8,9-sgl meaning "a four step approach with a Working Out Loud-principle".

My main question is what is the purpose of such MIN annotation? Why not include just the single head word for those who need a simple solution (so they don't need to extract the head from the dependency tree). Those with special needs can access the dependency tree and decide themselves whether to include PROPN/NNP/appos/prepositions/articles/conjuncts/etc for a given purpose.

amir-zeldes commented 2 years ago

Thanks for looking into this - you are right, the MIN annotation was implemented a little last minute, and there seem to be bugs in the implementation.

That said, MIN has several purposes, and I do not think it is useless, so unless I hear a compelling reason to move away from it, I expect we will keep it in the future:

  1. MIN is used in fuzzy matching for both NER and coreference resolution scoring. Systems that do not recover the exact boundaries of a mention but do contain the MIN span can receive partial credit in a fuzzy evaluation, and this allows corpora like GUM to participate in a task track with fuzzy scoring
  2. In training systems, it is possible to use a loss function or policy gradient that penalizes missing tokens in MIN spans more severely
  3. In many practical applications of NER, users may want to know the core lexical region of a mention, and this has been pointed out to me by industry users as well. Mentions such as "our friend Kim Zhang, the famous singer who was catapulted to fame last year" do not make it immediately obvious that the 'name' part is just "Kim Zhang"; but for many applications with dynamic authority tables (e.g. a list of this month's people of interest), it is a very common strategy to decide if a document is interesting by looking up precisely such names, verbatim, in a table. In many scenarios, MIN spans can be used for lookup in a knowledge base.

Concretely about the bugs:

martinpopel commented 2 years ago

the implementation currently only rules out named possessors

So we would have "Bob's Library", but "Library of Bob"?

That said, MIN has several purposes

I acknowledge these are valid purposes. I just think that we could achieve the same (or better) by including the head and all its flat and conj children (transitively, so conj->flat grandchild would be included as well).

amir-zeldes commented 2 years ago

I acknowledge these are valid purposes.

Yes, I must say I am like you and I also didn't immediately understand these issues, but some users feel very strongly about their usefulness!

I just think that we could achieve the same (or better) by including the head and all its flat and conj children (transitively, so conj->flat grandchild would be included as well).

Unfortunately (and for some applications fortunately) UD analyzed names with regular syntax compositionally, so for the "lookup in table of authorities" scenario (and arguably corresponding prioritization in fuzzy matching), this would not be sufficient. But for non-named entities, the implementation is indeed close to what you are saying.

poesio commented 2 years ago

Hi everybody

I am not sure about the context of this discussion, but being asked, let me make some general points and then get to Amir's questions

coming to Amir's questions

  1. easier questions first: don't know about correct, but we would mark your example as follows:

[ ([Koninklijke Bibliotheek])]

  1. the other example depends on what we take the name to be. If the official name is

`University Library of the Vrije Universiteit Amsterdam'

    then indeed we would have

[]

    (this is how we did [] or [<the New York State Financial Authority>], for instance)

  1. if I remember correctly we did have a couple of discussions about allowing discontinuous MINs but it was so rare that we never did it even though technically it would be possible. At any rate the point is somewhat moot until we get a UA scorer able to handle discontinuous mentions (or discontinuous MINs)

            Massimo

On 07/01/2022 14:25, Amir Zeldes wrote:

Thanks for looking into this - you are right, the MIN annotation was implemented a little last minute, and there seem to be bugs in the implementation.

That said, MIN has several purposes, and I do not think it is useless, so unless I hear a compelling reason to move away from it, I expect we will keep it in the future:

  1. MIN is used in fuzzy matching for both NER and coreference resolution scoring. Systems that do not recover the exact boundaries of a mention but do contain the MIN span can receive partial credit in a fuzzy evaluation, and this allows corpora like GUM to participate in a task track with fuzzy scoring
  2. In training systems, it is possible to use a loss function or policy gradient that penalizes missing tokens in MIN spans more severely
  3. In many practical applications of NER, users may want to know the core lexical region of a mention, and this has been pointed out to me by industry users as well. Mentions such as "our friend Kim Zhang, the famous singer who was catapulted to fame last year" do not make it immediately obvious that the 'name' part is just "Kim Zhang"; but for many applications with dynamic authority tables (e.g. a list of this month's people of interest), it is a very common strategy to decide if a document is interesting by looking up precisely such names, verbatim, in a table. In many scenarios, MIN spans can be used for lookup in a knowledge base.

Concretely about the bugs:

  • I wonder if we truly want to eliminate discontinuous MIN, since it is easy to derive the contiguous span, but this gives more information. The idea to insist on contiguity is only an attempt to emulate ARRAU (which, surprisingly, has discontinuous mention spans! But not MIN...)
  • Inclusion of named entity with named appos - I don't actually know the intended ARRAU policy here. What do you think @poesio https://github.com/poesio ? What is the correct MIN span of "the National Library of the Netherlands (Koninklijke Bibliotheek)"?
  • "I am not sure why Vrije Universiteit Amsterdam is included in MIN" - I misspoke, actually the implementation currently only rules out named possessors. Arguably "University Library of the Vrije Universiteit Amsterdam" is correct, as names are meant to be captured in their entirety, and this is its name (but again @poesio https://github.com/poesio may be able to elucidate this)

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1007445204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD6Z7PTPNDLS47VN3QGLUU3ZUHANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

amir-zeldes commented 2 years ago

Thanks for the quick feedback! Just a few questions:

martinpopel commented 2 years ago

allowing discontinuous MINs but it was so rare that we never did it

And what about all the coordinations where the second conjunct has some non-PROPN left-children? If MIN span has to be there and coordinations are annotated as a single mention, I actually prefer "my cat and your dog" over "my cat and your dog".

MIN is to allow for partial credit when the entire NP is used as a markable

Yes, but we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span.

number of cases which require adding on to the traditional notion of head

So can we find MIN span using a deterministic algorithm given a mention and its (UD) dependency tree (and all other mentions in the sentence)?

May I also ask what is the MIN span in the following mentions according to the GUM/ARRAU guidelines?

amir-zeldes commented 2 years ago

we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span

The syntactic head is not always the MIN span:

And depending on whether we like quantity nouns maybe:

There are also inverted head constructions:

I am also a little doubtful we can get these 100% automatically, but with some rules and lists, and given that we know about nested entities (at least in corpora with full nested mentions, such as ARRAU or GUM) we may be able to get this fairly close to gold just using rules, and maybe a classifier to spot suspicious cases.

I am also curious for Massimo's response, since I am a newcomer to MIN annotation!

martinpopel commented 2 years ago

The syntactic head is not always the MIN span

I know about many cases in ARRAU where MIN span does not include the UD-syntactic head (my list of mentions is based on such research and our paper), but I was not fully sure if (in which cases)

In the last case I would expect MIN spans in UD would be changed according to the UD dependency trees because Massimo wrote :we have been consulting the UD guidelines trying to be as consistent as possible with those".

The reason why I expected MIN span should include the head in general is because Massimo wrote "MIN in ARRAU is definitely meant to be the head of the NP" and "there are a number of cases which require adding on to the traditional notion of head", which I (perhaps wrongly) interpreted as that the UD head is always included in MIN, but sometimes more words need to be added.

poesio commented 2 years ago

General point: the general idea for proper names was to mark proper names, not proper nouns; and  to mark whatever the full proper name is, which need not consist of proper nouns only - so that  'the sun' and 'the pope' would also be considered  proper names. (The original inspiration, from the time of GNOME, was Loebner's notion of proper name, but recently there has been more work on the distinction between proper name and proper noun, see e.g., the recent survey by Schluecker and Ackermann 2017.)  The problem is that it's not always so obvious what counts as `the full propername', and we haven't found complete proposals, so we've been relying primarily on Quirk and Greenbaum's 'A University Grammar of English' (sections 4.23-4.30) and the Stanford Encyclopedia of Philosophy. Plus of course I'm not sure that all of my annotators always had the notion of proper name very clear ;-)   Is there a UD-specific treatment?

Answers to specific questions inline

On 08/01/2022 18:38, Amir Zeldes wrote:

Thanks for the quick feedback! Just a few questions:

  • Can you explain why "the" is included in MIN for |[<the National Library of the Netherlands> ([Koninklijke Bibliotheek])] |?

Because 'the' is part of the proper name? (as in 'The United States of America'?)

  • What would you do with these cases: o We went to [Bob's Starbucks]

[<Bob's Starbucks>] if the possessive is part of the proper name, else [Bob's ]

* o It was donated by [the Andy Warhol Institute]

see above

* o The statue is exhibited at [Ripley's Believe It or Not!]

[<Ripley's Believe It or Not!>]

*

  • If you would like discontinuous MIN in the future, I'm happy to implement it in GUM (or rather, it seems we forgot to implement the opposite ;)

Great!

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008099038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD63IPWCCKMSIICRKDYLUVCABRANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

poesio commented 2 years ago

On 09/01/2022 00:07, Martin Popel wrote:

allowing discontinuous MINs but it was so rare that we never did it

And what about all the coordinations where the second conjunct has some non-PROPN left-children? If MIN span has to be there and coordinations are annotated as a single mention, I actually prefer /"my cat and your dog"/ over /"my cat and your dog"/.

Sorry I am not sure I follow you Martin  -

      - for NP coordinations (as opposed to NPs with coordinated head nouns) in the end we decided to NOT mark a min because as far as I know         the most common view is that coordinated NPs do not have a head.  (NB in GNOME we had used the conjunction itself as head)

        so in that example we would mark the heads of the coordinated NPs, but no head for the coordination:

/[[[my] <cat>] and [[your] <dog>]]/

// PS am I right that in UD, John would be the head of John and Mary?/ /

MIN is to allow for partial credit when the entire NP is used as a
markable

Yes, but we can use head for the same purpose because the (UD dependency) head should be always included in the MIN span.

         Indeed. and in most cases, in ARRAU we do use the head as min. and in most cases the UD head is also the ARRAU min - coordinated NPs          should be the only exception I think

number of cases which require adding on to the traditional notion
of head

So can we find MIN span using a deterministic algorithm given a mention and its (UD) dependency tree (and all other mentions in the sentence)?

Indeed - see e.g. the paper Nafise Moosavi, Michael Strube and myself wrote for ACL 2019

May I also ask what is the MIN span in the following mentions according to the GUM/ARRAU guidelines?

  • one of the Russell Group universities

[ of [the Russell Group ]]

even though we generally go for an NP rather than a DP analysis, in the case of partitive constructions we use the determiner

  • President Carter
  • Mr. Simmons
  • vitamin C

in all these cases, the entire proper name is also the MIN

  • $25 million

[<$> 25 million] (we have lots of these cases in ARRAU_WSJ

  • most analysts

[most ] (as I said above, we generally go for the NP analysis ... )

  • most of the analysts

[ of [the ]]  .... which unfortunately means these two NPs have different mins.

  • half the trust’s total of $268 million

[ [[the <trust’s>] of [<$> 268 million]]]

  • April 9, 2007

[<April 9, [<2007>] >]

  • a year later

[a ] later

  • more than $1.6 trillion

[more than <$>1.6 trillion]

  • all of this

[ of []]

  • former chairman, Howard Weichern

[former chairman, []]

  • 3 % to 10 %

[[3 <% >] to [10 <%>]]    percentages is actually one of the cases we find more problematic and have discussed several times over the years

  • a 50 % stake

[a 50 % ]

  • the exploration side of the unit

[the exploration of [the ]]

  • the rate banks charge each other on overnight loans

[the banks charge each other on overnight loans] not sure what is the problem here?

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008189984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD65TXSEGADW4B6VUZPTUVDGT7ANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

poesio commented 2 years ago

On 09/01/2022 17:43, Amir Zeldes wrote:

we can use head for the same purpose because the (UD dependency)
head should be always included in the MIN span

The syntactic head is not always the MIN span:

  • [The city of ]
  • [her majesty ]

Sorry, what would be the UD head here? In ARRAU we mark

[The city of []] [her majesty ]

And depending on whether we like quantity nouns maybe:

  • [A number of ]

We do treat some complex determiners as determiners but not all:

[between 4 and 5 ] [a number of ]

There are also inverted head constructions:

  • [A hell of a ]

good example, I would probably suggest treating "a hell of" as a determiner but I'm not sure

I am also a little doubtful we can get these 100% automatically, but with some rules and lists, and given that we know about nested entities (at least in corpora with full nested mentions, such as ARRAU or GUM) we may be able to get this fairly close to gold just using rules, and maybe a classifier to spot suspicious cases.

I am also curious for Massimo's response, since I am a newcomer to MIN annotation!

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008342959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD623F7DCWVQ5EF57HT3UVHCNHANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

poesio commented 2 years ago

What would be the UD head in those cases?

On 09/01/2022 22:08, Martin Popel wrote:

The syntactic head is not always the MIN span

I know about many cases in ARRAU where MIN span does not include the UD-syntactic head (my list of mentions is based on such research and paper), but I was not fully sure if (in which cases)

  • this is a systematic difference between heads and MIN,
  • these are annotation errors,
  • the ARRAU annotators had different trees (with different heads) in their minds.

In the last case I would expect MIN spans in UD would be changed according to the UD dependency trees because Massimo wrote /:we have been consulting the UD guidelines trying to be as consistent as possible with those"/.

The reason why I expected MIN span should include the head /in general/ is because Massimo wrote /"MIN in ARRAU is definitely meant to be the head of the NP"/ and /"there are a number of cases which require adding on to the traditional notion of head"/, which I (perhaps wrongly) interpreted as that the UD head is always included in MIN, but sometimes more words need to be added.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44#issuecomment-1008433129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCOD65F5H2WNKVD6JLND5DUVIBM3ANCNFSM5LNZ76DQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

amir-zeldes commented 2 years ago

In UD I'm pretty sure "number" and "hell" would be the heads.

I think there are a number of differences between the choices above and what GUM currently does, but GUM's current MIN is probably buggy anyway, and we do not have a long tradition of this either. Some thoughts:

martinpopel commented 2 years ago

I actually prefer "my cat and your dog" over "my cat and your dog":

Sorry I am not sure I follow you Martin  -

This (public) discussion may be better viewed at GitHub https://github.com/UniversalDependencies/UD_English-GUM/issues/44 with Markdown formatting, so that MIN span in my example is marked with bold.

so in that example we would mark the heads of the coordinated NPs, but no head for the coordination: [[[my] <cat>] and [[your] <dog>]]

So coordination mention should have no MIN span in ARRAU? But it should always have embedded mentions of all the conjuncts, which have their MIN spans. So we could define (or automatically fill) coordination MIN span based on the union of its conjuncts' MIN spans (which will be discontinuous in many cases). OK, this makes sense.

am I right that in UD, John would be the head of John and Mary?

Yes. The first conjunct is the head of coordinations in UD.

in most cases, in ARRAU we do use the head as min and in most cases the UD head is also the ARRAU min - coordinated NPs should be the only exception I think

There are many other mismatches. We wrote a paper about it.

In ARRAU 2.1 LDC, wsjarrau_0617_markable_level.xml, I see markable_280 with span="word_1468..word_1469" min_ids="word_1468..word_1469", which means in "false claims" both words of the mention are the MIN span. Why? I don't think "false claims" is a proper name.

Maybe, we should ignore all the markable_level.xml files, as these don't include the coreference anyway. So our paper is based on the coref_level.xml files, instead. Unfortunately, these files are not consistent in ARRAU. For example, in wsjarrau_0617_coref_level.xml, there is no mention with span="word_1468..word_1469", but there is a mention with id="markable_873" span="word_1466..word_1478" min_ids="word_1469" min_words="claims" corresponding to "fraud and false claims in connection with a logistics-computer contract for the Army", which has wrong span (the coordination is not "fraud and claims", but "charges and claims") and also strange MIN span, whereas "claims" is the second conjunct.

  • President Carter

the entire proper name is also the MIN

OK, but in "our president, Jimmy Carter", only "Jimmy Carter" would be the MIN, am I right? BTW: UD treats these two differently as well, but the head is always the president: "President Carter" is a flat structure (all words depend on the first word with deprel=flat, or flat:name in this case), but in "our president, Carter", "Carter" depends on "president" with deprel=appos. See a paragraph about “fixed/close" vs. "loose/wide" apposition.

[<$> 25 million] (we have lots of these cases in ARRAU_WSJ

Yes, but according to the ARRAU coref_level.xml files, the MIN span is "million" in "$ 25 million". And also I see "President Carter" and "Mr. Simmons 's". It seems these are not exceptions (rare annotation errors), but a systematic divergence between the coref_level.xml and markable_level.xml files. Why is ARRAU using two different guidelines for annotating mentions and MIN? (Perhaps, we should not discuss it here, in the GUM GitHub issues, but I am not aware of any ARRAU repo with public issues.)

[most <analysts>] [<most> of [the <analysts>]]   .... which unfortunately means these two NPs have different mins

They have a different head in UD (same as the MIN you marked), so that's OK from my point of view. That said, I've found "most analysts" in wsjarrau_1110_coref_level.xml. I could understand reasons for such decision (similarity to "most/majority of"), but again, I would like to see guidelines for such MIN annotation.

[<half> [[the <trust’s>] <total>of [<$> 268 million]]]

So the MIN of the outer mention is "half". In UD, "total" is the head of this phrase and "half" depends on it with deprel=det:predet. So this is another mismatch between ARRAU's MIN and UD's head.

[a <year>] later

But according to vpc_0766_coref_level.xml, it is "a year later" (i.e. "later" is part of the mention) and according to vpc_0766_markable_level.xml, there is no mention at all. I could not find UD guidelines on how to parse such phrase. English-GUM has two occurrences of "year later" with a different head each time.

[more than <$>1.6 trillion]

But according to wsjarrau_0692_coref_level.xml, it is "more than $1.6 trillion" and according to wsjarrau_0692_markable_level.xml, the whole mention is MIN.

[<all> of [<this>]]

But according to coref_level.xml, "this" is the MIN of the whole (outer) mention.

[[3 <% >] to [10 <%>]]    percentages is actually one of the cases we find more problematic and have discussed several times over the years

But according to coref_level.xml, the second "%" is the MIN of the whole mention. In UD, the first conjunct is the head of coordination, so the first "%" is the head of the whole mention.

[a 50 % <stake>]

But according to coref_level.xml, "%" is the MIN.

[the exploration <side> of [the <unit>]]

But according to coref_level.xml, "exploration" is the MIN.

[the <rate> banks charge each other on overnight loans]

But according to coref_level.xml, "banks" is the MIN.

My conclusions

I am sorry for the longish list of mismatches, but I think it illustrates nicely my feeling that the MIN annotation guidelines are far from clear and that there are many disagreements even within ARRAU ('s annotators). There are many phenomena (most vs most of, dates, currencies, percentages, coordinations, nested coordinations,...) for which it is difficult to define head/MIN. I think it could be easier to just adopt the decisions in UD, instead of working on parallel guidelines. If there is a need for annotating more than a single head word, we could have a deterministic algorithm which adds other conjuncts (all deprel=conj children) and the whole proper name (this is slightly more difficult, depending on the exact definition of "proper name", but we can start with adding all deprel=flat children),

amir-zeldes commented 2 years ago

I agree this could be clearer, and essentially I would be happiest if we could more or less predict this from the trees plus POS tags (which indicate proper names) and nested entities. Taken together, they offer a lot of information, and give us the opportunity to make MINs follow what I take to be the three main principles:

If we want to build some exceptions on top of that for semantically weak heads (a number/couple/bunch/sort/kind of...) then that's fine too, and of course we could iteratively improve the algorithm.