Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)
This repository only contains annotations, without the underlying textual data from Reddit
In order to obtain the underlying text, you will need to use the script get_text.py
. For more information on the underlying Reddit text see this page. For Universal Dependencies annotations of other genres from GUM, see https://github.com/UniversalDependencies/UD_English-GUM
GUM, the Georgetown University Multilayer corpus, is an open source collection of richly annotated texts from multiple text types. The corpus is collected and expanded by students as part of the curriculum in the course LING-4427 "Computational Corpus Linguistics" at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (usually Creative Commons licenses), so that new texts can be annotated and published with ease.
The dependencies in the corpus up to GUM version 5 were originally annotated using Stanford Typed Depenencies (de Marneffe & Manning 2013) and converted automatically to UD using DepEdit (https://gucorpling.org/depedit/). The rule-based conversion took into account gold entity annotations found in other annotation layers of the GUM corpus (e.g. entity annotations), and has since been corrected manually in native UD. The original conversion script used can found in the GUM build bot code from version 5, available from the (non-UD) GUM repository. Documents from version 6 of GUM onwards were annotated directly in UD, and subsequent manual error correction to all GUM data has also been done directly using the UD guidelines. Enhanced dependencies were added semi-automatically from version 7.1 of the corpus. For more details see the corpus website.
The MISC column contains morphological segmentation, Construction Grammar, entity, coreference, information status, Wikification and discourse annotations from the full GUM corpus, encoded using the annotations MSeg
, Cxn
, Entity
, SplitAnte
, Bridge
and Discourse
.
Morphological segmentation in GUM is annotated in the MISC field MSeg
attribute semi-automatically using the Unimorph lexical resource (Kirov et al. 2018), specifically using scripts based on the lexicon data here. Analyses are concatenative, using hyphens as separators, and are guaranteed to sum up to the string of each token with only hyphens added. Existing hyphens in a word form are retained and assumed to be meaningful. Analyses cover inflection, derivation and compounding. For example:
Note that stems are retained in their orthographic forms (explanation does not become explain+ation), and 'etymological affixation' in loanwords is not necessarily analyzed (e.g. "ex" is not split off since the corresponding affixation process is no longer interpretable in English). For more information and updates to the segmentation guidelines see the GUM wiki.
GUM uses the MISC field Cxn
annotation to distinguish some complex constructions in a Construction Grammar (CxG) framework developed by collaborators from Dagstuhl Seminar 23191 for the integration of CxG analyses into UD trees. Construction labels are always attached to the highest token belonging to the necessary or defining elements of the construction, and carry hierarchical designations, such as a prefix Cxn=Conditional
for all conditional constructions, but a more specific Cxn=UnspecifiedEpistemic-Reduced
for reduced conditionals (the type seen in "if possible"). Currently covered constructions are listed in the GUM wiki.
The Entity
annotation uses the CoNLL 2012 shared task bracketing format, which identifies potentially coreferring entities using round opening and closing brackets as well as a unique ID per entity, repeated across mentions. In the following example, actor Jared Padalecki appears in a single token mention, labeled (1-person-giv:act-cf2*-1-coref-Jared_Padalecki)
indicating the entity type (person
) combined with the unique ID of all mentions of Padalecki in the text (1-person
). Because Padalecki is a named entity with a corresponding Wikipedia page, the Wikification identifier corresponding to his Wikipedia page is given after the last hyphen (1-person-Jared_Padalecki
). We can also see an information status annotation (giv:act
, indicating an aforementioned or 'given' entity, actively mentioned last no farther than the previous sentences; see Dipper et al. 2007), a Centering Theory annotation (cf2*
, indicating he is the second most central salient entity in the sentence moving forward, and that he was mentioned in the previous sentence, indicated by the *
), as well as minimum token ID information indicating the head tokens for fuzzy matching (in this case 1
, the first and only token in this span) and the coreference type coref
, indicating lexical subsequent mention. The labels for each part of the hyphen-separated annotation are given at the top of each document in a comment # global.Entity = GRP-etype-infstat-centering-minspan-link-identity
, indicating that these annotations consist of the entity group id (i.e the coreference group), entity type, information status, centering theory annotation, minimal span of tokens for head matching, the coreference link type, and named entity identity (if available).
Multi-token mentions receive opening brackets on the line in which they open, such as (97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles
, and a closing annotation 97)
at the token on which they end. Multiple annotations are possible for one token, corresponding to nested entities, e.g. (175-time-giv:inact-cf5-1-coref)189)188)
below corresponds to the single token and last token of the time entities "2015" and "April 2015" respectively, as well as the last token of the larger "the second campaign in the Always Keep Fighting series in April 2015".
# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
...
1 For for ADP IN _ 4 case 4:case Discourse=joint-sequence_m:104->98:2:lex-indph-954-955
2 the the DET DT Definite=Def|PronType=Art 4 det 4:det Bridge=173<188|Entity=(188-event-acc:inf-cf6-3,6,8-sgl
3 second second ADJ JJ Degree=Pos|NumType=Ord 4 amod 4:amod _
4 campaign campaign NOUN NN Number=Sing 16 obl 16:obl:for _
5 in in ADP IN _ 10 case 10:case _
6 the the DET DT Definite=Def|PronType=Art 10 det 10:det Entity=(173-abstract-giv:inact-cf3-2,4,5-coref
7 Always Always ADV NNP Number=Sing 8 advmod 8:advmod XML=<hi rend:::"italic">
8 Keep Keep PROPN NNP Number=Sing 10 compound 10:compound _
9 Fighting Fighting PROPN NNP Number=Sing 8 xcomp 8:xcomp XML=</hi>
10 series series NOUN NN Number=Sing 4 nmod 4:nmod:in Entity=173)
11 in in ADP IN _ 12 case 12:case _
12 April April PROPN NNP Number=Sing 4 nmod 4:nmod:in Entity=(189-time-new-cf10-1-sgl|XML=<date when:::"2015-04">
13 2015 2015 NUM CD NumForm=Digit|NumType=Card 12 nmod:tmod 12:nmod:tmod Entity=(175-time-giv:inact-cf5-1-coref)189)188)|SpaceAfter=No|XML=</date>
14 , , PUNCT , _ 4 punct 4:punct _
15 Padalecki Padalecki PROPN NNP Number=Sing 16 nsubj 16:nsubj Entity=(1-person-giv:act-cf2*-1-coref-Jared_Padalecki)
16 partnered partner VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0 root 0:root _
17 with with ADP IN _ 18 case 18:case _
18 co-star co-star NOUN NN Number=Sing 16 obl 16:obl:with Entity=(97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles
19 Jensen Jensen PROPN NNP Number=Sing 18 appos 18:appos XML=<ref target:::"https://en.wikipedia.org/wiki/Jensen_Ackles">
20 Ackles Ackles PROPN NNP Number=Sing 19 flat 19:flat Entity=97)|XML=</ref>
21 to to PART TO _ 22 mark 22:mark Discourse=purpose-goal:105->104:0:syn-nfn-963
22 release release VERB VB VerbForm=Inf|Voice=Act 16 advcl 16:advcl:to _
23 a a DET DT Definite=Ind|PronType=Art 24 det 24:det Entity=(190-object-new-cf7-2-coref
24 shirt shirt NOUN NN Number=Sing 22 obj 22:obj Entity=190)
25 featuring feature VERB VBG VerbForm=Ger|Voice=Act 24 acl 24:acl Discourse=elaboration-attribute:106->105:0:syn-mdf-966+syn-nmn-967
26 both both DET DT PronType=Tot 25 obj 25:obj Entity=(191-object-new-cf9-1-sgl
27 of of ADP IN _ 29 case 29:case _
28 their their PRON PRP$ Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Prs 29 nmod:poss 29:nmod:poss Entity=(192-person-acc:aggr-cf1-1-coref)|SplitAnte=1<192,97<192
29 faces face NOUN NNS Number=Plur 26 nmod 26:nmod:of Entity=191)|SpaceAfter=No
In addition, a list of the globally most salient entities in each document can be found in the metadata at the beginning of the document, for example:
# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184
Where the value 1
stands for Padalecki, as in the annotations above.
Possible values for the other annotations mentioned above are:
# transition
annotationsFor equivalent Wikidata identifiers for each Wikipedia article title, see this file.
The annotations SplitAnte
and Bridge
mark non-strict identity anaphora (see the Universal Anaphora project for more details). For example, at token 28 in the example, the pronoun "their" refers back to two non-adjacent entities, requiring a split antecedent annotation. The value SplitAnte=1<192,97<192
indicates that 192-person
(the pronoun "their") refers back to two previous Entity annotations, with pointers separatated by a comma: 1
(1-person-...Jared_Padalecki
) and 97
(97-person-...Jensen_Ackles
).
Bridging anaphora is annotated when an entity has not been mentioned before, but is resolvable in context by way of a different entity: for example, token 2 has the annotation Bridge=173<188
, which indicates that although 188-event
("the second campaign...") has not been mentioned before, its identity is mediated by the previous mention of another entity, 173-abstract
(the project "Always Keep Fighting", mentioned earlier in the document, to which the campaign event belongs). In other words, readers can infer that "the second campaign" is part of the already introduced larger project, which also had a first campaign. This inference also leads to the information status label acc:inf
, accessible-inferable.
Discourse annotations are given in eRST dependencies following the conversion from RST constituent trees as suggested by Li et al. (2014) - for the original RST constituent parses of GUM see the source repo. At the beginning of each Elementary Discourse Unit (EDU), an annotation Discourse
gives the discourse function of the unit beginning with that token, followed by a colon, the ID of the current unit, and an arrow pointing to the ID of the parent unit in the discourse parse. For instance, Discourse=purpose-goal:105->104:0:syn-inf-963
at token 21 in the example below means that this token begins discourse unit 105, which functions as a purpose-goal
to unit 104, which begins at token 1 in this sentence ("Padalecki partnered with co-star Jensen Ackles --purpose-goal-> to release a shirt..."). The third number :0
indicates that the attachment has a depth of 0, without an intervening span in the original RST constituent tree (this information allows deterministic reconstruction of the RST constituent discourse tree from the conllu file). The final part of the Discourse
annotation indicates categorized signals which correspond to the discourse relation in question, as defined by eRST - in this case, syn-inf-963
indicates a syntactic signal (syn
) of the subtype "infinitival_clause" (inf
), since the purpose relation is signaled by the use of an infinitive, a typical strategy in English. The index 963
refers to the position of the signal, in this case token number 963 in the document (excluding empty nodes), the infinitive 'to' (token 21 in the sentence). Multiple signals are separated by +
. See below for the inventory of signal types.
Additionally, note that multiple discourse relations can sometimes occur on the same line, since eRST allows multiple concurrent and tree-breaking relations to be identified. In such cases the multiple relation entries will be separated by ;
and ordered such that the primary relation (which indicates RST nuclearity and is guaranteed to be projective in the discourse tree) will be serialized first, and non-projective secondary relations are guaranteed to be serialized subsequently. The unique ROOT
node of the discourse tree has no arrow notation, e.g. Discourse=ROOT:2:0
means that this token begins unit 2, which is the Central Discourse Unit (or discourse root) of the current document. Although it is easiest to recover RST constituent trees from the source repo, it is also possible to generate them automatically from the dependencies with depth information, using the scripts in the rst2dep repo.
Discourse relations in GUM are defined based on the effect that W (a writer/speaker) has on R (a reader/hearer) by modifying a Nucleus discourse unit (N) with another discourse unit (a Satellite, S, or another N). Discourse relation units can precede their nuclei (satellite-nucleus, or SN relation), follow them (NS), or be coordinated with each other (NN or multinuclear relations). Relations are classified hierarchically into 15 major classes and include:
Relation signals fall into nine major classes, most with several subtypes each, and include:
Markup from the original XML annotations using TEI tags is available in the XML MISC annotation, which indicates which XML tags, if any, were opened or closed before or after the current token, and in what order. In tokens 7-9 in the example above, the XML annotations indicate the words "Always Keep Fighting" were originally italicized using the tag pair <hi rend="italic">...</hi>
, which opens at token 7 and closes after token 9. To avoid confusion with the =
sign in MISC annotations, XML =
signs are escaped and represented as :::
.
7 Always Always ADV NNP Number=Sing 8 advmod 8:advmod XML=<hi rend:::"italic">
8 Keep Keep PROPN NNP Number=Sing 10 compound 10:compound _
9 Fighting Fighting PROPN NNP Number=Sing 8 xcomp 8:xcomp XML=</hi>
XML block tags spanning whole sentences (i.e. not beginning or ending mid sentence), such as paragraphs (<p>
) or headings (<head>
) are instead represented using the standard UD # newpar_block
comment under the # newpar
comment, which may however feature nested tags, for example:
# newpar
# newpar_block = list type:::"unordered" (10 s) | item (4 s)
This comment indicates the opening of a <list type="unordered">
block element, which spans 10 sentences ((10 s)
). However, the list begins with a nested block, a list item (i.e. a bullet point), which spans 4 sentences, as indicated after the pipe separator. For documentation of XML elements in GUM, please see the GUM wiki.
More information and additional annotation layers can also be found in the GUM source repo.
Document metadata is given at the beginning of each new document in key-value pair comments beginning with the prefix meta::
, as in:
# newdoc id = GUM_bio_padalecki
# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
# meta::author = Wikipedia, The Free Encyclopedia
# meta::dateCollected = 2019-09-10
# meta::dateCreated = 2004-08-14
# meta::dateModified = 2019-09-11
# meta::genre = bio
# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184
# meta::sourceURL = https://en.wikipedia.org/wiki/Jared_Padalecki
# meta::speakerCount = 0
# meta::summary = Jared Padalecki is an award winning American actor who gained prominence in the series Gilmore Girls, best known for playing the role of Sam Winchester in the TV series Supernatural, and for his active role in campaigns to support people struggling with depression, addiction, suicide and self-harm.
# meta::title = Jared Padalecki
Document summaries are included in the metadata summary
annotation and follow strict guidelines described here. For the test set, a second human written summary is available called summary2
.
Additionally, sentences carry some sentence-level annotations in CoNLL-U comment annotations, such as sentence types in s_type
(declarative, imperative, wh-question, fragment, etc.), as well as sentence transition types based on Centering Theory and sentence prominence levels based on graph proximity to the discourse parse root. For example, this fragment sentence (frag
) establishes a new backwards looking Center (establishment
) and is a level-2 sentence (s_prominence = 2
, i.e. its discourse nesting level is one further than a sentence containing the level-1 Central Discourse Unit of the entire text.
# s_prominence = 2
# s_type = frag
# transition = establishment
# text = Jared Padalecki
1 Jared Jared PROPN NNP Number=Sing 0 root 0:root MSeg=Jared
2 Padalecki Padalecki PROPN NNP Number=Sing 1 flat 1:flat _
The training, development and test sets contain complete, contiguous documents, balanced for genre. Test and dev contain similar amounts of data, usually around 1,800 tokens in each genre in each, and the rest is assigned to training. For the exact file lists in each split see:
https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists
GUM annotation team (so far - thanks for participating!)
Adrienne Isaac, Akitaka Yamada, Alex Giorgioni, Alexandra Berends, Alexandra Slome, Amani Aloufi, Amber Hall, Amelia Becker, Andrea Price, Andrew O'Brien, Ángeles Ortega Luque, Aniya Harris, Anna Prince, Anna Runova, Anne Butler, Arianna Janoff, Aryaman Arora, Ayan Mandal, Aysenur Sagdic, Bertille Baron, Bradford Salen, Brandon Tullock, Brent Laing, Caitlyn Pineault, Calvin Engstrom, Candice Penelton, Carlotta Hübener, Caroline Gish, Charlie Dees, Chenyue Guo, Chloe Evered, Cindy Luo, Colleen Diamond, Connor O'Dwyer, Cristina Lopez, Cynthia Li, Dan DeGenaro, Dan Simonson, Derek Reagan, Devika Tiwari, Didem Ikizoglu, Edwin Ko, Eliza Rice, Emile Zahr, Emily Pace, Emma Manning, Emma Rafkin, Ethan Beaman, Felipe De Jesus, Han Bu, Hana Altalhi, Hang Jiang, Hannah Wingett, Hanwool Choe, Hassan Munshi, Helen Dominic, Ho Fai Cheng, Hortensia Gutierrez, Jakob Prange, James Maguire, Janine Karo, Jehan al-Mahmoud, Jemm Excelle Dela Cruz, Jess Godes, Jessica Cusi, Jessica Kotfila, Jingni Wu, Joaquin Gris Roca, John Chi, Jongbong Lee, Juliet May, Jungyoon Koh, Katarina Starcevic, Katelyn Carroll, Katelyn MacDougald, Katherine Vadella, Khalid Alharbi, Kristen Cook, Lara Bryfonski, Lauren Levine, Leah Northington, Lindley Winchester, Linxi Zhang, Lucia Donatelli, Luke Gessler, Mackenzie Gong, Margaret Anne Rowe, Margaret Borowczyk, Maria Laura Zalazar, Maria Stoianova, Mariko Uno, Mary Henderson, Maya Barzilai, Md. Jahurul Islam, Michael Kranzlein, Michaela Harrington, Mingyeong Choi, Minnie Annan, Mitchell Abrams, Mohammad Ali Yektaie, Naomee-Minh Nguyen, Negar Siyari, Nicholas Mararac, Nicholas Workman, Nicole Steinberg, Nitin Venkateswaran, Parker DiPaolo, Phoebe Fisher, Rachel Kerr, Rachel Thorson, Rebecca Childress, Rebecca Farkas, Riley Breslin Amalfitano, Rima Elabdali, Robert Maloney, Ruizhong Li, Ryan Mannion, Ryan Murphy, Sakol Suethanapornkul, Sarah Bellavance, Sarah Carlson, Sasha Slone, Saurav Goswami, Sean Macavaney, Sean Simpson, Seyma Toker, Shane Quinn, Shannon Mooney, Shelby Lake, Shira Wein, Sichang Tu, Siddharth Singh, Siona Ely, Siyao Peng, Siyu Liang, Stephanie Kramer, Sylvia Sierra, Talal Alharbi, Tatsuya Aoyama, Tess Feyen, Timothy Ingrassia, Trevor Adriaanse, Ulie Xu, Wai Ching Leung, Wenxi Yang, Wesley Scivetti, Xiaopei Wu, Xiulin Yang, Yang Liu, Yi-Ju Lin, Yifu Mu, Yilun Zhu, Yingzhu Chen, Yiran Xu, Young-A Son, Yu-Tzu Chang, Yuhang Hu, Yunjung Ku, Yushi Zhao, Zhijie Song, Zhuosi Luo, Zhuxin Wang, Amir Zeldes
... and other annotators who wish to remain anonymous!
To cite the Reddit subset of GUM in particular, please use this citation:
@InProceedings{BehzadZeldes2020,
author = {Shabnam Behzad and Amir Zeldes},
title = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging},
booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)},
pages = {50--56},
year = {2020},
}
As a scholarly citation for the GUM corpus as a whole, please use this article (note that this paper predates the inclusion of Reddit data in GUM):
@Article{Zeldes2017,
author = {Amir Zeldes},
title = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
journal = {Language Resources and Evaluation},
year = {2017},
volume = {51},
number = {3},
pages = {581--612},
doi = {http://dx.doi.org/10.1007/s10579-016-9343-x}
}
2024-02-15
2023-10-31
2023-02-02
2022-10-21
2022-04-29
2022-01-31
2022-01-09
2021-12-14
2021-11-01
2021-09-23
HYPH
xpos tag_m
suffix to multinuclear discourse dependencies (distinguishes multinuclear and satellite restatements)2021-05-01
2021-03-10
2021-01-20
2020-10-31
2020-05-15 v2.6
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.6 License: CC BY 4.0 Includes text: no Genre: blog social Lemmas: manual native UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: manual native Contributors: Peng, Siyao;Zeldes, Amir Contributing: elsewhere Contact: amir.zeldes@georgetown.edu ===============================================================================