UniversalDependencies / UD_Irish-IDT

Irish data
Other
6 stars 7 forks source link

Review of "flat" required for Irish UDT v2.7 #9

Closed tlynn747 closed 3 years ago

tlynn747 commented 4 years ago

We have "over labelled" the Irish data with the flat label. Full review needed before next release.

In general, we applied the flat label to proper noun strings, in an attempt to capture named entities. But reviewing the UD guidelines on best practice - and talking to other treebank developers, we're going to pare this back to only using flat with Proper noun strings that don't have internal structures. We'll tag named entities in the MISC column instead as per the English GUM treebank.

e.g. organisation names and Titles will need to receive full syntactic analysis Acadamh an Bhaile Meánach (Ballymena Academy): (Acadamh = head, an = det, Bhaile Meánach = nmod) Roinn Fiontar, Trádála agus Fostaíochta (Roinn= head, Fiontar =nmod, Trádála + Fostaíochta = conjs, agus = cc) an Pháirtí Náisiúnta (National Party) (an=det, Pháirtí=head, Náisiúnta=amod) na Náisiún Aontaithe (the UN): (na=det Náisiún=head Aontaithe =amod) An tAire Stáit (Minister for State): (An = det, tAire = head, Stáit=nmod) Easpag na Gaillimh (Bishop of Galway): (Easpag = head, na= det, Gaillimh = nmod)

But Organisations that don't have meaningful internal structure or identifiable head can remain as flat: Fianna Fáil Sinn Féin Fine Gael

Suggesting to also keep multiword months, days, towns/cities as flat mí Mheán Fómhair (September): (mí = head, Mheán = nmod, Fómhair = flat) Dé Luain - Monday Baile Meánach - Ballymena

dan-zeman commented 4 years ago

https://github.com/UniversalDependencies/docs/issues/608

tlynn747 commented 3 years ago

Steps towards reviewing flat:

Add NamedEntity=Yes to all currently labelled flat and flat:name tokens first. (DONE)

Revise flat only (flat:name is uncontroversial).

Will start by over-generalising the reversal of flats to internal structure annotations, by catching a pattern of structures that typically represent flats: (using XPOS Noun to capture both PROPN and NOUN)

Fianna Fail, Noun Noun Co. Maigh Eo Noun Noun Noun Raidió na Gaeltachta, Noun DET Noun Cláraitheoir na Cúirte Céadchéime Noun DET Noun+ Comhdháil Náisiúnta na Gaeilge Noun ADJ DET Noun Éirí Amach, Afraic Theas. Noun ADV (though these should have been ADJ anyway)

Over-generalising these structures means that some valid flats will be affected. Review needed to find these.

Some not caught by the script are: Tighe den Oireachtas (PP attachment which requires case analysis) Choimisiún Uí Bhrolcháin Roinn X agus Y, Roinn A, B agus Z (requires coordination analysis)

The remaining "flat" instances will help identify the potential valid flats: e.g. abbreviations (e.g. Ltd), date strings (e.g. 19 Márta 1958),

tlynn747 commented 3 years ago

Over-use of flat fully reviewed.

Have decided to remove flat label and fully annotate internal structure for Proper Noun strings with internal structure, such as the following:

Days (Dé Luain) Months (Mean Fomhair) Organisations (Raidió na Gaeltachta, Fianna Fáil, Chomhairle Cathrach Bhaile Átha Cliath, Roinn Gnóthaí Pobail, Tuaithe agus Gaeltachta) Placenames (Baile na dTor, Co. Maigh Eo, Afraic Theas)

flat now only used in following contexts: Titles (Naomh Gréagóir, Uasal Maskey, Dr. Colm) Date strings (19 Márta 1958)

ALL of the above have been labelled as named entities: NamedEntity=Yes in MISC column