UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Should lemmas use normalized spelling? #513

Closed nschneid closed 3 years ago

nschneid commented 6 years ago

The current guidelines for Lemmas say very little about what the canonical form of a word should be.

In the UD_English corpus, there are some clear typos, e.g.:

The first is an example of a typo resulting in a non-word; the second is the spelling of a different word (with a different POS). The syntactic annotations are correct, but it would be helpful to be able to find these as examples of take and to, respectively. And for downstream semantic annotation (e.g. word senses) there needs to be a way to map to the correct dictionary form.

In the absence of a more elaborate policy for misspellings (#330), should we at least correct the spelling in the lemma?

dan-zeman commented 6 years ago

I think the lemma should follow the correct spelling. And ideally, the morphological features should indicate that the actual surface form is incorrect (Typo=Yes).

sylvainkahane commented 6 years ago

I think we really need additional columns in the conll. At least one for glosses. Corpora of "exotic" languages come with glosses and we need them if we want to work with such corpora. It would also be useful for corpora of social media or of language learners to indicate corrections. Of course we can put everything in the same column as feature values, but it is not very convenient. Even for the presentation of non-English examples in the guideline, it is very inconvenient not to have an additional line for the gloss.

martinpopel commented 6 years ago

An additional column would be a huge change of CoNLL-U (though possible to do in UD v3), which will brake many existing tools. All data not using this column would be more difficult to read&write (unless we allow to omit it). Also, it is convenient to have MISC as the last column. The FEATS column is reserved for morpho-syntactic features (ideally with a fix-sized list of possible values for each feature), so it is not suitable for glosses. I recommend using the MISC column and semi-official attributes Gloss (usually English translation), Translit (transliteration of FORM) and LTranslit (transliteration of LEMMA). This is used e.g. in UD_Arabic and several other treebanks:

1   ميراث   مِيرَاث NOUN    N------S1I  Case=Nom|Definite=Ind|Number=Sing   6   nsubj   _   Vform=مِيرَاثٌ|Gloss=inheritance,heritage|Root=w_r__t|Translit=mīrāṯun|LTranslit=mīrāṯ
2   ب   بِ  ADP P---------  AdpType=Prep    3   case    _   SpaceAfter=No|Vform=بِ|Gloss=by,with|Root=bi|Translit=bi|LTranslit=bi
dan-zeman commented 6 years ago

@sylvainkahane : This is a different thread, so please start a new issue if you want to extend the discussion. However, for glosses, there is an optional MISC attribute, already used in some treebanks, so please use it. See http://universaldependencies.org/format.html#other-miscellaneous-attributes

nschneid commented 6 years ago

@dan-zeman: Can we start using Typo=Yes now or is that currently invalid? Typo appears in the v2 feature list but its documentation says v1.

dan-zeman commented 6 years ago

It is in the "language-specific" (or here, treebank-specific?) domain, both in v1 and in v2. (In the v2 feature list, it only occurs in the automatically generated alphabetical listing in the lower part of the page, but it is not listed in the upper part; this should be fixed somehow.)

So you can start using it but you will have to list it as a treebank-specific feature, otherwise the validator will report your data as invalid.

nschneid commented 6 years ago

@dan-zeman you mean by adding it to https://github.com/UniversalDependencies/docs/tree/pages-source/_en/feat?

@sebschu, do you want me to go ahead and do this? Tomorrow I'm planning to go through the typos that we've encountered in lexical semantic annotation, so I could use that opportunity to fix the lemma and set Typo=Yes.

dan-zeman commented 6 years ago

@nschneid : No, I did not mean the English-specific documentation. (Of course you can add it there but we no longer assume that every language will have a complete mirror of the documentation of all tags / features / deprels, and this one already has a doc page.) What I meant is the data file for validator, here:

https://github.com/UniversalDependencies/tools/blob/master/data/feat_val.en

Note that these files are currently separate for every treebank, so e.g. if UD_English-GUM uses Typo=Yes, then the feature must be also added to feat_val.en_gum.

amir-zeldes commented 6 years ago

That's an interesting idea for UD_English-GUM, and in fact we already have an annotation layer for errors (not all of which are typos though), and we have coincidentally been following this guideline already (lemma has the 'corrected' lemma form). We used the TEI <sic> annotation, which is very much the same as saying 'there's some kind of error here', so just like typo, for example:

https://corpling.uis.georgetown.edu/annis/#_q=c2ljIF89XyBsZW1tYSBfPV8gdG9rICYgIzIgIT0gIzM&_c=R1VN&cl=5&cr=5&s=0&l=10

I should point out one shortcoming of this approach which we ran into, which is that the corrected token form is not available anywhere. For example in the first result in the link above:

... ... ... ... ... ... ...
FORM podcasters who where doing literary stuff
LEMMA podcaster who be do literary stuff
TYPO TRUE

We can tell that this sentence has an instance of 'be', but not that the inflected past auxiliary 'were' is in this sentence. Another issue is that sometimes we are not sure which word is the 'typo' - we usually annotated multiple tokens inside for these cases:

what events is today important?

So here it's unclear if it should say 'event' or 'are'...

The example sentence "what events is today important?" also brings up the issue of questionable word ordering (>> what events is important today?). Where should this be indicated? We will want a different ticket for this, too.

nschneid commented 6 years ago

@amir-zeldes Right, the general issue of errors is more complex: see #330.

nschneid commented 6 years ago

For the more limited cases I was raising here, I propose we alter the guidelines where it says

The LEMMA field should contain the canonical or base form of the word, such as the form typically found in dictionaries.

to add

At present, treebanks have considerable leeway in interpreting what "canonical or base form" means. In general, a canonical form should collapse inflectional and minor orthographic/spelling variation (such as casing, accents/diacritics, and typos). In the lemma field, some treebanks may choose to aggressively normalize spelling variation that may reflect dialect or authorial style.

In addition to normalizing spelling in lemmas, treebanks are encouraged to adopt the optional morphological feature Typo=Yes for clear accidental misspellings of a word (e.g. ltake for take or too for to). Treebank maintainers should take care not to use Typo=Yes for words that may reflect actual linguistic variation, e.g., dialect, style, or nonnative grammar.

(There is currently no UD-wide policy for lemmas of apparently erroneous extra words, missing words, or incorrectly segmented words.)

amir-zeldes commented 6 years ago

Agreed - the incorrect segmentation issue is tricky, esp. for 'goeswith' cases. One example we had was 'before hand' spelled separately - we ended up lemmatizing separately, as 'before' and 'hand', but putting a tag around the whole thing and using 'goeswith' for the dependency...

nschneid commented 6 years ago

Another issue relevant here is abbreviations. For uncommon abbreviations/shortened forms (like w for with, btwn for btwn, thru for through), I'm inclined to say we should use the canonical spelling in the lemma and apply the feature Abbr=Yes. For common abbreviations like vs. for versus and etc. for et cetera, perhaps we should keep the surface form in the lemma.

nschneid commented 6 years ago

Actual sentence in the corpus: "Thi$ $ervice will co$t." OrthographicWordplay=Yes? :D

sebschu commented 6 years ago

Another issue relevant here is abbreviations. For uncommon abbreviations/shortened forms (like w for with, btwn for btwn, thru for through), I'm inclined to say we should use the canonical spelling in the lemma and apply the feature Abbr=Yes. For common abbreviations like vs. for versus and etc. for et cetera, perhaps we should keep the surface form in the lemma.

@nschneid I think it will be quite hard to distinguish what constitutes a common/uncommon abbreviation (for example, I think w for with is actually quite common...). So I'd suggest we normalize all abbreviations/short forms of single words but keep abbreviations for multiple words (e.g., UN or CPU) intact. I think that would allow for more consistent annotations.

nschneid commented 5 years ago

Looking at EWT lemmas that are not in a dictionary, I see it's not just typos and abbreviations; "sooo", "soooo", and "sooooo" are attested, for example. I wonder if we need ExpressiveSpelling=Yes.

amir-zeldes commented 5 years ago

I'd vote to lemmatize as 'so', since users searching the corpus can't predict how many o's will appear and it seems like an arbitrary inflation of the lemma type count.

rueter commented 5 years ago

@nschneid I'd also vote for "so" as the lemma. I am also glad you have brought up the idea of ExpressiveSpelling=Yes, although I am not quite sure whether that will cover other conceivable patterns, i.e. the shivering or stuttering speaker may have word-initial reduplication It's c-cold in here. Would this latter instance then be DescriptiveSpelling=Yes? (1) we are indicating the expressive language of the speaker, who is most likely attempting to convey something in saying sooo. (2) we are providing a description that draws our attention not to assessment or evaluation but the speech facilities of the actor when rendering c-cold.

nschneid commented 5 years ago

@rueter good point. But I wonder if the line between expressive and descriptive would be tough to draw in some cases. E.g. in quoted speech, would "sooo" be descriptive rather than expressive?

Should the expressive use of capitalization be tagged as expressive/descriptive, even if it doesn't show up in the lowercased lemma? E.g. when a word is in all-caps for emphasis or to indicate it was said loudly.

Other cases to think about:

Some attested English examples at https://github.com/UniversalDependencies/UD_English-EWT/issues/68

Speakers of tonal languages: can tones be modified in expressive ways?

nschneid commented 3 years ago

This has come up again for EWT: we want a way to provide a normalized spelling for a form that is not really a typo or abbreviation but conveys something more. How about we pilot the feature for English with the following definition:

amir-zeldes commented 3 years ago

I've co-authored a paper on non-standard language in treebanks of User-Generated Content, which originally appeared here:

https://www.aclweb.org/anthology/2020.lrec-1.645.pdf

And an expansion of which is currently under review. In it we reviewed some existing TB practices to code these things, and we recommend following the existing attribute Style. From the paper:

Style=X, employed by some treebanks to describe various aspects of linguistic style such as [Coll: colloquial, Expr: expressive, Vrnc: vernacular, Slng: slang]

So for expressing lengthening I think this could be Style=Expr.

nschneid commented 3 years ago

Based on https://universaldependencies.org/u/feat/Style.html, I took the Style feature to be about word choice and morphology, rather than orthographic choices. Have treebanks been using it for those as well? Is it not worth distinguishing when a form reflects an ordinary word spelled/pronounced differently ("niiiiice") versus an expressive morphological inflection?

amir-zeldes commented 3 years ago

I definitely agree that "expressive" can mean different things, and maybe we need different values to distinguish a paradigmatic morphological category traditionally called "expressive" from expressive spellings. But I think the feature name could accommodate both, if they correspond to a stylistic choice (which I understood this to mean - otherwise if it is a regular morphological category "expressivity" it should not be in a feature called style IMO)

Adding @msang @dseddah @tlynn747 (sorry I don't know everybody else's GH handles, feel free to add!)

nschneid commented 3 years ago

From the example

(The diminutive signals affection rather than size. The neutral equivalent would be čokolád.)

it seems to be explaining the choice of inflection (diminutive) which on pure semantic grounds might be unexpected, but makes sense as a stylistic signal.

But perhaps Style=Expr should be interpreted broadly. I am just sensitive to the fact that other features (Typo, Abbr, SpaceAfter, ...) are specific to orthographic rather than grammatical choices.

nschneid commented 3 years ago

@dan-zeman Care to weigh in regarding the feature name? I figure you have an opinion on how Style=Expr was intended and whether it could also apply to expressive spelling choices.

dan-zeman commented 3 years ago

I don't see any problem with extending Style=Expr to spelling choices. (And then it would be niiiiice to mention such examples in the documentation :-).)

As for how it was intended, well, it was originally intended to preserve annotation from the Prague Dependency Treebank, which itself turned out to be quite inconsistent. So this should not and does not define anything. I think the speaker/writer has various choices that are appropriate for a particular style/register, and the choices may be lexical, morphological, syntactic, but also phonological and orthographical.

Stormur commented 3 years ago

This has come up again for EWT: we want a way to provide a normalized spelling for a form that is not really a typo or abbreviation but conveys something more. How about we pilot the feature for English with the following definition:

* `ExpressiveSpelling=Yes` signals that the form of a word uses a spelling that is nonstandard _in ways other than capitalization_, where the speaker presumably intended to convey some additional detail of pronunciation or meaning (and the word is not an abbreviation or typo). For example, expressive lengthening ("niiiiice"), dialectal or colloquial pronunciation ("Hahvahd"), censored characters ("sh*t"), symbolic characters ("CA$H"), etc. This feature should be paired with `CorrectForm` in the MISC field providing the standard spelling, and the lemma should use a standard spelling as well. Spellings that are nonstandard merely in capitalization choices are not currently flagged with any feature or `CorrectForm`, though lemmas should not use idiosyncratic capitalization.

Maybe some of the more regular types of "expressive" spellings, the ones which border on inflection/derivation, might be covered by the Emphatic feature I proposed in #741 , and which we are,as it were, experimentally introducing in the new Latin Treebank.

Otherwise Style seems to be a quite perfect fit!

By the way, as @amir-zeldes mentioned, probably altered word forms of the diminutive/augmentative/endearment/pejorative type deserve more specific values for all those languages which distinguish at least some of them systematically (like Italian: casa 'house/home' -> dim. casetta, casina; end. casuccia, casotta; pej. casaccia, casaccia; aug. casona; ecc.). I mean, I think the expressivity of niiiiiiiiiiice (of an emphatic type, spontaneous) is different from that of homey (diminutive/endearment and systematical), also from a formal perspective. The former might correspond to a generic Style=Expr or Emphatic=Yes, the latter to something like Variant=Dim (I found Derivation, too).

nschneid commented 3 years ago

I don't think all of the expressive spellings I have in mind are emphatic, and I'd rather not try to split hairs between emphatic spellings and other types of expressive spellings. So I support Style=Expr for spelling variants.

An Emphatic feature sounds like a good idea for morphological markers that primarily signal emphasis, but that probably doesn't apply to English.

dseddah commented 3 years ago

talking about Emphatic morphological features, how about -ish in English ? Is there anything special about it in the various En treebanks ? a) you’d easily pay £7 for a large glass of wine (so think that’s about $9 ish?) b) Isn't it funny how hybe color is yellow(ish, like lime yellow) and now bts' next single is yellow as well? c) hellooo it was okayish but feelings can hit you anytime it seems

(I'd love to see the UD analysis for (b))

nschneid commented 3 years ago

"ish" is a hedge or similarity-marker, which doesn't feel like emphasis to me.

We do have the occasional affix that is tokenized as a separate word. They're not always annotated consistently, e.g. UniversalDependencies/UD_English-EWT#152.

(Fun fact: for a course I once wrote a squib about "ish" and its colloquial usages. It can act like an affix or like a separate discourse particle.)

nschneid commented 3 years ago

I've updated the docs:

Hopefully this clears everything up!