clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
42 stars 52 forks source link

Schema: additional named entities annotations #41

Closed matyaskopp closed 3 years ago

matyaskopp commented 3 years ago

We use for annotating named entities Czech CNEC 2.0 taxonomy, which contains 46 nested classes: http://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

For ParlaMint purposes, we flatten hierarchal and merge categories into ConLL2003 categories PER, LOC, ORG and MISC. But we want to also keep our taxonomy.

Example of annotation (additional info is in XML comments):

https://github.com/clarin-eric/ParlaMint/blob/609ff4be010e2bbd27f69c936f21b36784a2f7e4/ParlaMint-CZ/ParlaMint-CZ_2014-09-10_ps2013-016-01-000-000.ana.xml#L1278-L1290

Taxonomy starts here:

https://github.com/matyaskopp/ParlaMint/blob/9b4948532863562bd5da52421de7f1b2b613ac61/ParlaMint-CZ/ParlaMint-CZ.ana.xml#L566

An idea of schema modification

In the above-mentioned example, it distinguishes a forename and a surname which can be very useful for following annotations (linked-name entities). It would be a pity to lose this kind of information.

TomazErjavec commented 3 years ago

But we want to also keep our taxonomy.

I agree, it would be a shame not to.

But, as you will also use the "standard" NER categories, first note that you should also have the standard NER taxonomy in the root teiHeader: https://github.com/clarin-eric/ParlaMint/blob/ba0b5015c6755ff67d9477938a63069083d0fe54/ParlaMint-SI/ParlaMint-SI.ana.xml#L267

Then, we use name/@type to mark the type of name (even though in our set-up it would make more sense to use @ana but we are following TEI here), so it would be more logical to use @subtype for your fine-grained NER categories. If this sounds acceptable, I will add it to the schema. (in which case no need to use the "ne:" prefix).

In your NER taxonomy you have IDs like "NER.cnec2.0.ms", while the gloss (i.e. catDesc/term) is often multiword and in plural. In any user-facing app the user when lookng at a NER will either see a very cryptic label "NER.cnec2.0.ms" or a long and multiword "radio and tv stations". So, maybe you should have short labels in catDesc/term, an the rest of the catDesc is the long(ish) explanation. You current labels, I also would argue, are wrong, as the label on a NER identifies one radio or tv station, not their totality. Note also that the typology itself is not consistent, most are in plural, but some in singular e.g. "email address" or ""periodical". But all this is just a suggestion, and your typology is probably set in stone anyway.

In conclusion:

matyaskopp commented 3 years ago

But, as you will also use the "standard" NER categories, first note that you should also have the standard NER taxonomy in the root teiHeader:

I am sorry. I have been too rush with deploying... It will be fixed in next commit.

Then, we use name/@type to mark the type of name (even though in our set-up it would make more sense to use @ana but we are following TEI here), so it would be more logical to use @subtype for your fine-grained NER categories. If this sounds acceptable, I will add it to the schema. (in which case no need to use the "ne:" prefix).

No. It does not make sense to me. CNEC is a different taxonomy. So @subtype is not the best solution...

What I have done: I have taken outmost name and look to its category first letter:

But consider this sentence: Letiště Václava Havla Praha (you can try it here: http://lindat.mff.cuni.cz/services/nametag/)

image

There are only two conll2003 name entities ORG and LOC - Both categories are in cnec2.0 nested types. But p for "Václava Havla" is in cnec2.0 super-type but it does not appear in conll2003 categories because conll does not allow nested types.

So I am strongly for distinguishing these two taxonomies.

In your NER taxonomy you have IDs like "NER.cnec2.0.ms", while the gloss (i.e. catDesc/term) is often multiword and in plural. In any user-facing app the user when lookng at a NER will either see a very cryptic label "NER.cnec2.0.ms" or a long and multiword "radio and tv stations". So, maybe you should have short labels in catDesc/term, an the rest of the catDesc is the long(ish) explanation. You current labels, I also would argue, are wrong, as the label on a NER identifies one radio or tv station, not their totality. Note also that the typology itself is not consistent, most are in plural, but some in singular e.g. "email address" or ""periodical". But all this is just a suggestion, and your typology is probably set in stone anyway.

I have used the prefix NER.cnec2.0 to be sure ids will be unique. I think we can change the terms in taxonomy, but I have to discuss it with taxonomy authors.

TomazErjavec commented 3 years ago

CNEC is a different taxonomy. So @subtype is not the best solution...

OK, I have added name/@ana and name/name (in fact, I just made it recursive), also made name/@type optional, and, for good measure, added name@subtype as well. Pls. test.

But consider this sentence: Letiště Václava Havla Praha.

Yes, I can see the problem, thanks for the explanation!

TomazErjavec commented 3 years ago

Can this be closed now, with reference in #46 to here?

matyaskopp commented 3 years ago

Yes, all sub-issues are reported elsewhere.