UniversalDependencies / UD_English-GUMReddit

Other
1 stars 2 forks source link

Summary

Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)

Introduction

This repository only contains annotations, without the underlying textual data from Reddit

In order to obtain the underlying text, you will need to use the script get_text.py. For more information on the underlying Reddit text see this page. For Universal Dependencies annotations of other genres from GUM, see https://github.com/UniversalDependencies/UD_English-GUM

GUM, the Georgetown University Multilayer corpus, is an open source collection of richly annotated texts from multiple text types. The corpus is collected and expanded by students as part of the curriculum in the course LING-4427 "Computational Corpus Linguistics" at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (usually Creative Commons licenses), so that new texts can be annotated and published with ease.

The dependencies in the corpus up to GUM version 5 were originally annotated using Stanford Typed Depenencies (de Marneffe & Manning 2013) and converted automatically to UD using DepEdit (https://gucorpling.org/depedit/). The rule-based conversion took into account gold entity annotations found in other annotation layers of the GUM corpus (e.g. entity annotations), and has since been corrected manually in native UD. The original conversion script used can found in the GUM build bot code from version 5, available from the (non-UD) GUM repository. Documents from version 6 of GUM onwards were annotated directly in UD, and subsequent manual error correction to all GUM data has also been done directly using the UD guidelines. Enhanced dependencies were added semi-automatically from version 7.1 of the corpus. For more details see the corpus website.

Additional annotations in MISC

The MISC column contains morphological segmentation, Construction Grammar, entity, coreference, information status, Wikification and discourse annotations from the full GUM corpus, encoded using the annotations MSeg, Cxn, Entity, SplitAnte, Bridge and Discourse.

MSeg

Morphological segmentation in GUM is annotated in the MISC field MSeg attribute semi-automatically using the Unimorph lexical resource (Kirov et al. 2018), specifically using scripts based on the lexicon data here. Analyses are concatenative, using hyphens as separators, and are guaranteed to sum up to the string of each token with only hyphens added. Existing hyphens in a word form are retained and assumed to be meaningful. Analyses cover inflection, derivation and compounding. For example:

Note that stems are retained in their orthographic forms (explanation does not become explain+ation), and 'etymological affixation' in loanwords is not necessarily analyzed (e.g. "ex" is not split off since the corresponding affixation process is no longer interpretable in English). For more information and updates to the segmentation guidelines see the GUM wiki.

Cxn

GUM uses the MISC field Cxn annotation to distinguish some complex constructions in a Construction Grammar (CxG) framework developed by collaborators from Dagstuhl Seminar 23191 for the integration of CxG analyses into UD trees. Construction labels are always attached to the highest token belonging to the necessary or defining elements of the construction, and carry hierarchical designations, such as a prefix Cxn=Conditional for all conditional constructions, but a more specific Cxn=UnspecifiedEpistemic-Reduced for reduced conditionals (the type seen in "if possible"). Currently covered constructions are listed in the GUM wiki.

Entity

The Entity annotation uses the CoNLL 2012 shared task bracketing format, which identifies potentially coreferring entities using round opening and closing brackets as well as a unique ID per entity, repeated across mentions. In the following example, actor Jared Padalecki appears in a single token mention, labeled (1-person-giv:act-cf2*-1-coref-Jared_Padalecki) indicating the entity type (person) combined with the unique ID of all mentions of Padalecki in the text (1-person). Because Padalecki is a named entity with a corresponding Wikipedia page, the Wikification identifier corresponding to his Wikipedia page is given after the last hyphen (1-person-Jared_Padalecki). We can also see an information status annotation (giv:act, indicating an aforementioned or 'given' entity, actively mentioned last no farther than the previous sentences; see Dipper et al. 2007), a Centering Theory annotation (cf2*, indicating he is the second most central salient entity in the sentence moving forward, and that he was mentioned in the previous sentence, indicated by the *), as well as minimum token ID information indicating the head tokens for fuzzy matching (in this case 1, the first and only token in this span) and the coreference type coref, indicating lexical subsequent mention. The labels for each part of the hyphen-separated annotation are given at the top of each document in a comment # global.Entity = GRP-etype-infstat-centering-minspan-link-identity, indicating that these annotations consist of the entity group id (i.e the coreference group), entity type, information status, centering theory annotation, minimal span of tokens for head matching, the coreference link type, and named entity identity (if available).

Multi-token mentions receive opening brackets on the line in which they open, such as (97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles, and a closing annotation 97) at the token on which they end. Multiple annotations are possible for one token, corresponding to nested entities, e.g. (175-time-giv:inact-cf5-1-coref)189)188) below corresponds to the single token and last token of the time entities "2015" and "April 2015" respectively, as well as the last token of the larger "the second campaign in the Always Keep Fighting series in April 2015".

# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
...
1   For for ADP IN  _   4   case    4:case  Discourse=joint-sequence_m:104->98:2:lex-indph-954-955
2   the the DET DT  Definite=Def|PronType=Art   4   det 4:det   Bridge=173<188|Entity=(188-event-acc:inf-cf6-3,6,8-sgl
3   second  second  ADJ JJ  Degree=Pos|NumType=Ord  4   amod    4:amod  _
4   campaign    campaign    NOUN    NN  Number=Sing 16  obl 16:obl:for  _
5   in  in  ADP IN  _   10  case    10:case _
6   the the DET DT  Definite=Def|PronType=Art   10  det 10:det  Entity=(173-abstract-giv:inact-cf3-2,4,5-coref
7   Always  Always  ADV NNP Number=Sing 8   advmod  8:advmod    XML=<hi rend:::"italic">
8   Keep    Keep    PROPN   NNP Number=Sing 10  compound    10:compound _
9   Fighting    Fighting    PROPN   NNP Number=Sing 8   xcomp   8:xcomp XML=</hi>
10  series  series  NOUN    NN  Number=Sing 4   nmod    4:nmod:in   Entity=173)
11  in  in  ADP IN  _   12  case    12:case _
12  April   April   PROPN   NNP Number=Sing 4   nmod    4:nmod:in   Entity=(189-time-new-cf10-1-sgl|XML=<date when:::"2015-04">
13  2015    2015    NUM CD  NumForm=Digit|NumType=Card  12  nmod:tmod   12:nmod:tmod    Entity=(175-time-giv:inact-cf5-1-coref)189)188)|SpaceAfter=No|XML=</date>
14  ,   ,   PUNCT   ,   _   4   punct   4:punct _
15  Padalecki   Padalecki   PROPN   NNP Number=Sing 16  nsubj   16:nsubj    Entity=(1-person-giv:act-cf2*-1-coref-Jared_Padalecki)
16  partnered   partner VERB    VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0   root    0:root  _
17  with    with    ADP IN  _   18  case    18:case _
18  co-star co-star NOUN    NN  Number=Sing 16  obl 16:obl:with Entity=(97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles
19  Jensen  Jensen  PROPN   NNP Number=Sing 18  appos   18:appos    XML=<ref target:::"https://en.wikipedia.org/wiki/Jensen_Ackles">
20  Ackles  Ackles  PROPN   NNP Number=Sing 19  flat    19:flat Entity=97)|XML=</ref>
21  to  to  PART    TO  _   22  mark    22:mark Discourse=purpose-goal:105->104:0:syn-nfn-963
22  release release VERB    VB  VerbForm=Inf|Voice=Act  16  advcl   16:advcl:to _
23  a   a   DET DT  Definite=Ind|PronType=Art   24  det 24:det  Entity=(190-object-new-cf7-2-coref
24  shirt   shirt   NOUN    NN  Number=Sing 22  obj 22:obj  Entity=190)
25  featuring   feature VERB    VBG VerbForm=Ger|Voice=Act  24  acl 24:acl  Discourse=elaboration-attribute:106->105:0:syn-mdf-966+syn-nmn-967
26  both    both    DET DT  PronType=Tot    25  obj 25:obj  Entity=(191-object-new-cf9-1-sgl
27  of  of  ADP IN  _   29  case    29:case _
28  their   their   PRON    PRP$    Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Prs 29  nmod:poss   29:nmod:poss    Entity=(192-person-acc:aggr-cf1-1-coref)|SplitAnte=1<192,97<192
29  faces   face    NOUN    NNS Number=Plur 26  nmod    26:nmod:of  Entity=191)|SpaceAfter=No

In addition, a list of the globally most salient entities in each document can be found in the metadata at the beginning of the document, for example:

# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184

Where the value 1 stands for Padalecki, as in the annotations above.

Possible values for the other annotations mentioned above are:

For equivalent Wikidata identifiers for each Wikipedia article title, see this file.

Split antecedent and bridging

The annotations SplitAnte and Bridge mark non-strict identity anaphora (see the Universal Anaphora project for more details). For example, at token 28 in the example, the pronoun "their" refers back to two non-adjacent entities, requiring a split antecedent annotation. The value SplitAnte=1<192,97<192 indicates that 192-person (the pronoun "their") refers back to two previous Entity annotations, with pointers separatated by a comma: 1 (1-person-...Jared_Padalecki) and 97 (97-person-...Jensen_Ackles).

Bridging anaphora is annotated when an entity has not been mentioned before, but is resolvable in context by way of a different entity: for example, token 2 has the annotation Bridge=173<188, which indicates that although 188-event ("the second campaign...") has not been mentioned before, its identity is mediated by the previous mention of another entity, 173-abstract (the project "Always Keep Fighting", mentioned earlier in the document, to which the campaign event belongs). In other words, readers can infer that "the second campaign" is part of the already introduced larger project, which also had a first campaign. This inference also leads to the information status label acc:inf, accessible-inferable.

Enhanced RST discourse trees and signals

Discourse annotations are given in eRST dependencies following the conversion from RST constituent trees as suggested by Li et al. (2014) - for the original RST constituent parses of GUM see the source repo. At the beginning of each Elementary Discourse Unit (EDU), an annotation Discourse gives the discourse function of the unit beginning with that token, followed by a colon, the ID of the current unit, and an arrow pointing to the ID of the parent unit in the discourse parse. For instance, Discourse=purpose-goal:105->104:0:syn-inf-963 at token 21 in the example below means that this token begins discourse unit 105, which functions as a purpose-goal to unit 104, which begins at token 1 in this sentence ("Padalecki partnered with co-star Jensen Ackles --purpose-goal-> to release a shirt..."). The third number :0 indicates that the attachment has a depth of 0, without an intervening span in the original RST constituent tree (this information allows deterministic reconstruction of the RST constituent discourse tree from the conllu file). The final part of the Discourse annotation indicates categorized signals which correspond to the discourse relation in question, as defined by eRST - in this case, syn-inf-963 indicates a syntactic signal (syn) of the subtype "infinitival_clause" (inf), since the purpose relation is signaled by the use of an infinitive, a typical strategy in English. The index 963 refers to the position of the signal, in this case token number 963 in the document (excluding empty nodes), the infinitive 'to' (token 21 in the sentence). Multiple signals are separated by +. See below for the inventory of signal types.

Additionally, note that multiple discourse relations can sometimes occur on the same line, since eRST allows multiple concurrent and tree-breaking relations to be identified. In such cases the multiple relation entries will be separated by ; and ordered such that the primary relation (which indicates RST nuclearity and is guaranteed to be projective in the discourse tree) will be serialized first, and non-projective secondary relations are guaranteed to be serialized subsequently. The unique ROOT node of the discourse tree has no arrow notation, e.g. Discourse=ROOT:2:0 means that this token begins unit 2, which is the Central Discourse Unit (or discourse root) of the current document. Although it is easiest to recover RST constituent trees from the source repo, it is also possible to generate them automatically from the dependencies with depth information, using the scripts in the rst2dep repo.

Discourse relations in GUM are defined based on the effect that W (a writer/speaker) has on R (a reader/hearer) by modifying a Nucleus discourse unit (N) with another discourse unit (a Satellite, S, or another N). Discourse relation units can precede their nuclei (satellite-nucleus, or SN relation), follow them (NS), or be coordinated with each other (NN or multinuclear relations). Relations are classified hierarchically into 15 major classes and include:

Relation signals fall into nine major classes, most with several subtypes each, and include:

XML

Markup from the original XML annotations using TEI tags is available in the XML MISC annotation, which indicates which XML tags, if any, were opened or closed before or after the current token, and in what order. In tokens 7-9 in the example above, the XML annotations indicate the words "Always Keep Fighting" were originally italicized using the tag pair <hi rend="italic">...</hi>, which opens at token 7 and closes after token 9. To avoid confusion with the = sign in MISC annotations, XML = signs are escaped and represented as :::.

7   Always  Always  ADV NNP Number=Sing 8   advmod  8:advmod    XML=<hi rend:::"italic">
8   Keep    Keep    PROPN   NNP Number=Sing 10  compound    10:compound _
9   Fighting    Fighting    PROPN   NNP Number=Sing 8   xcomp   8:xcomp XML=</hi>

XML block tags spanning whole sentences (i.e. not beginning or ending mid sentence), such as paragraphs (<p>) or headings (<head>) are instead represented using the standard UD # newpar_block comment under the # newpar comment, which may however feature nested tags, for example:

# newpar
# newpar_block = list type:::"unordered" (10 s) | item (4 s)

This comment indicates the opening of a <list type="unordered"> block element, which spans 10 sentences ((10 s)). However, the list begins with a nested block, a list item (i.e. a bullet point), which spans 4 sentences, as indicated after the pipe separator. For documentation of XML elements in GUM, please see the GUM wiki.

More information and additional annotation layers can also be found in the GUM source repo.

Metadata

Document metadata is given at the beginning of each new document in key-value pair comments beginning with the prefix meta::, as in:

# newdoc id = GUM_bio_padalecki
# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
# meta::author = Wikipedia, The Free Encyclopedia
# meta::dateCollected = 2019-09-10
# meta::dateCreated = 2004-08-14
# meta::dateModified = 2019-09-11
# meta::genre = bio
# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184
# meta::sourceURL = https://en.wikipedia.org/wiki/Jared_Padalecki
# meta::speakerCount = 0
# meta::summary = Jared Padalecki is an award winning American actor who gained prominence in the series Gilmore Girls, best known for playing the role of Sam Winchester in the TV series Supernatural, and for his active role in campaigns to support people struggling with depression, addiction, suicide and self-harm.
# meta::title = Jared Padalecki

Document summaries are included in the metadata summary annotation and follow strict guidelines described here. For the test set, a second human written summary is available called summary2.

Additionally, sentences carry some sentence-level annotations in CoNLL-U comment annotations, such as sentence types in s_type (declarative, imperative, wh-question, fragment, etc.), as well as sentence transition types based on Centering Theory and sentence prominence levels based on graph proximity to the discourse parse root. For example, this fragment sentence (frag) establishes a new backwards looking Center (establishment) and is a level-2 sentence (s_prominence = 2, i.e. its discourse nesting level is one further than a sentence containing the level-1 Central Discourse Unit of the entire text.

# s_prominence = 2
# s_type = frag
# transition = establishment
# text = Jared Padalecki
1   Jared   Jared   PROPN   NNP Number=Sing 0   root    0:root  MSeg=Jared
2   Padalecki   Padalecki   PROPN   NNP Number=Sing 1   flat    1:flat  _

Documents and splits

The training, development and test sets contain complete, contiguous documents, balanced for genre. Test and dev contain similar amounts of data, usually around 1,800 tokens in each genre in each, and the rest is assigned to training. For the exact file lists in each split see:

https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists

Acknowledgments

GUM annotation team (so far - thanks for participating!)

Adrienne Isaac, Akitaka Yamada, Alex Giorgioni, Alexandra Berends, Alexandra Slome, Amani Aloufi, Amber Hall, Amelia Becker, Andrea Price, Andrew O'Brien, Ángeles Ortega Luque, Aniya Harris, Anna Prince, Anna Runova, Anne Butler, Arianna Janoff, Aryaman Arora, Ayan Mandal, Aysenur Sagdic, Bertille Baron, Bradford Salen, Brandon Tullock, Brent Laing, Caitlyn Pineault, Calvin Engstrom, Candice Penelton, Carlotta Hübener, Caroline Gish, Charlie Dees, Chenyue Guo, Chloe Evered, Cindy Luo, Colleen Diamond, Connor O'Dwyer, Cristina Lopez, Cynthia Li, Dan DeGenaro, Dan Simonson, Derek Reagan, Devika Tiwari, Didem Ikizoglu, Edwin Ko, Eliza Rice, Emile Zahr, Emily Pace, Emma Manning, Emma Rafkin, Ethan Beaman, Felipe De Jesus, Han Bu, Hana Altalhi, Hang Jiang, Hannah Wingett, Hanwool Choe, Hassan Munshi, Helen Dominic, Ho Fai Cheng, Hortensia Gutierrez, Jakob Prange, James Maguire, Janine Karo, Jehan al-Mahmoud, Jemm Excelle Dela Cruz, Jess Godes, Jessica Cusi, Jessica Kotfila, Jingni Wu, Joaquin Gris Roca, John Chi, Jongbong Lee, Juliet May, Jungyoon Koh, Katarina Starcevic, Katelyn Carroll, Katelyn MacDougald, Katherine Vadella, Khalid Alharbi, Kristen Cook, Lara Bryfonski, Lauren Levine, Leah Northington, Lindley Winchester, Linxi Zhang, Lucia Donatelli, Luke Gessler, Mackenzie Gong, Margaret Anne Rowe, Margaret Borowczyk, Maria Laura Zalazar, Maria Stoianova, Mariko Uno, Mary Henderson, Maya Barzilai, Md. Jahurul Islam, Michael Kranzlein, Michaela Harrington, Mingyeong Choi, Minnie Annan, Mitchell Abrams, Mohammad Ali Yektaie, Naomee-Minh Nguyen, Negar Siyari, Nicholas Mararac, Nicholas Workman, Nicole Steinberg, Nitin Venkateswaran, Parker DiPaolo, Phoebe Fisher, Rachel Kerr, Rachel Thorson, Rebecca Childress, Rebecca Farkas, Riley Breslin Amalfitano, Rima Elabdali, Robert Maloney, Ruizhong Li, Ryan Mannion, Ryan Murphy, Sakol Suethanapornkul, Sarah Bellavance, Sarah Carlson, Sasha Slone, Saurav Goswami, Sean Macavaney, Sean Simpson, Seyma Toker, Shane Quinn, Shannon Mooney, Shelby Lake, Shira Wein, Sichang Tu, Siddharth Singh, Siona Ely, Siyao Peng, Siyu Liang, Stephanie Kramer, Sylvia Sierra, Talal Alharbi, Tatsuya Aoyama, Tess Feyen, Timothy Ingrassia, Trevor Adriaanse, Ulie Xu, Wai Ching Leung, Wenxi Yang, Wesley Scivetti, Xiaopei Wu, Xiulin Yang, Yang Liu, Yi-Ju Lin, Yifu Mu, Yilun Zhu, Yingzhu Chen, Yiran Xu, Young-A Son, Yu-Tzu Chang, Yuhang Hu, Yunjung Ku, Yushi Zhao, Zhijie Song, Zhuosi Luo, Zhuxin Wang, Amir Zeldes

... and other annotators who wish to remain anonymous!

References

To cite the Reddit subset of GUM in particular, please use this citation:

@InProceedings{BehzadZeldes2020,
  author    = {Shabnam Behzad and Amir Zeldes},
  title     = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging},
  booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)},
  pages     = {50--56},
  year      = {2020},
}

As a scholarly citation for the GUM corpus as a whole, please use this article (note that this paper predates the inclusion of Reddit data in GUM):

@Article{Zeldes2017,
  author    = {Amir Zeldes},
  title     = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
  journal   = {Language Resources and Evaluation},
  year      = {2017},
  volume    = {51},
  number    = {3},
  pages     = {581--612},
  doi       = {http://dx.doi.org/10.1007/s10579-016-9343-x}
}

Changelog

=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.6
License: CC BY 4.0
Includes text: no
Genre: blog social
Lemmas: manual native
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: manual native
Contributors: Peng, Siyao;Zeldes, Amir
Contributing: elsewhere
Contact: amir.zeldes@georgetown.edu
===============================================================================