Inconistent lemmatization of English punctuation

rhdunn commented 1 year ago

Looking at the lemmatization across the English treebanks, I've found some inconsistencies in the lemmatization of punctuation tokens between those treebanks:

ellipsis:
- ... (\u002E\u002E\u002E) is lemmatized as is in EWT and PUD, but as … (\u2026) in GUM and GENTLE.
- … (\u2026) in lemmatized as is in GUM, GENTLE, and PUD. In EWT it is lemmatized as . even in mid-sentence punctuation, which is an error.
- At the end of a sentence EWT lemmatizes ellipses as . whereas the other treebanks keep them as is.
double quotes:
- ", “, and ” are lemmatized as " (\u0022) in EWT and PUD, but as '' (\u0027\u0027) in GENTLE, GUM, and GUMReddit. This looks like a lemmatization error in the GUM/GENTLE treebanks.
hyphens and dashes:
- - (u002D) is lemmatized consistently as - (u002D) across the treebanks.
- – (\u2013) EN DASH is lemmatized consistently as - (u002D) across the treebanks.
- — (\u2014) EM DASH is lemmatized as - (u002D) in EWT, but as is in GENTLE. GUM, and PUD.
- -- (\u002D\u002D), and --- (\u002D\u002D\u002D) are lemmatized as is in EWT, but as - (\u002D) in GENTLE, GUM, and GUMReddit.
- EWT has forms with 4 or more - (\u002D) where it keeps the lemma as is. Some of these are mid-sentence punctuation (so would be candidates for a single - (u002D)) whereas others are the sole punctuation in sentences which would indicate their use as a section break rather than hyphenation to separate clauses.

It would be good to have a unified consistent lemmatization across the treebanks for these.

amir-zeldes commented 1 year ago

Agreed, thanks - I would collapse em dash to hyphen and can implement that in the GU datasets to match en dash. The ellipsis behavior in GUM seems correct to me.

For double and single quotes, GU corpora intentionally use straight single quote for and type of single quote, and two straight single quotes for any type of double quotes. This is maybe sort of Latex inspired. Some practical reasons for this early on were an old TreeTagger models that did that and the use of some XML tools that delimit annotations as double quotes attributes. Never using double straight quotes allows us to use these tools with worrying about escaping, and get only one kind of quotes in json files for convenience. If no one objects strongly I would like to keep this practice of only the plain straight quote in lemmas.

rhdunn commented 1 year ago

I'd prefer double quotes (straight or curly) to have the " (\u0022) lemma, as the CoNLL-U format is designed around the linguistic characteristics of the tokens, not their representation in another format. It also aligns with how they are lemmatized in EWT and PUD. -- Having a different lemmatization for GUM/GENTLE would produce errors in output from trained models.

martinpopel commented 1 year ago

I am old enough to remember TreeTagger and the PennTB decision to lemmatize quotes using the `LaTeX escape codes''. I also remember I hated this decision even then - if there is a tool that cannot accept quotes on the input, we should write a wrapper/API for that tool that will do the de/escaping transparently, but we should not mess up the data. BTW: I think TreeTagger used to lemmatize any number (written in digits) as@card@`, but I hope we don't want to do this in UD just to stay backward compatible with TreeTagger.

amir-zeldes commented 1 year ago

No, this was definitely not a call to conform with TreeTagger, just a historical explanation.

As of right now, we do have tools in the pipeline that don't do proper XML escaping in annotations, so I hesitate changing the double quote lemma, though we can consider it as a longer term to do. Keep in mind though that if we lemmatize smart quotes to \u0022, then we are already choosing some meta-representation for the class of double quotes, so nothing really prevents us from saying that the double quote lemma should be '' (two single quotes), in keeping with the PTB tradition (and it is also the xpos tag, a decision I think we are definitely stuck with)

martinpopel commented 1 year ago

we do have tools in the pipeline that don't do proper XML escaping in annotations

I'm not sure how this is related to XML escaping (CoNLL-U is not XML, XML requires escaping also other 4 characters, which are not escaped in GUM lemmas, the XML way of escaping quotes is ", not ` and''`, but it is needed only in XML attributes, not in XML text). Either way you can write a wrapper for these tools, which converts the input into the format they require, including changing the lemma of quotes.

if we lemmatize smart quotes to \u0022

This is another question (relevant for this GitHub issue) and I am not so sure here. I would be OK with form=lemma for all punctuation. I can imagine disambiguating the opening and closing quotation mark in the lemma, i.e. using the curly/typographic (also called "smart quotes" because of the feature in word processors) as the lemma for straight quotes. That said, I can imagine also the opposite approach (straight quotes as lemma) motivated by users who search for any quotes (and don't know how to write a regex matching any quotes).

nschneid commented 1 year ago

From a purely UD standpoint it seems arbitrary to lemmatize quotes with an "escaped" spelling that wouldn't appear in most surface text as escaping is not an issue in the .conllu format.

If there was a very strong tradition of doing this across the English NLP ecosystem (i.e. if all the modern lemmatizers mapped " to '') then there would be an argument to follow that in UD, but I don't believe that's the case. PTB is rather outdated as an encoding standard for non-alphanumeric characters.

I'm not intimately familiar with the GUM pipeline, but would a short-term solution be to postprocess the .conllu file just when publishing it to the official UD repo?

amir-zeldes commented 1 year ago

I agree, if I were doing this from scratch I would also choose double straight quotes as the lemma for all types of double quotes. And as I said above, I am also willing to change this in corpora we maintain at some point in the future, I just can't do it immediately due to the tools we work with right now. We could certainly convert things at a very late step before pushing to UD, but for the moment I would like to avoid that, since it would create a discrepancy between the general GUM repo and the UD version.

XML requires escaping also other 4 characters, which are not escaped in GUM lemmas

Yes, but some of the tools in our pipeline which process the lemmas have an interchange format of the type <elem key="value"> with double straight quotes, so concretely only that glyph is a concern at the moment.

Either way you can write a wrapper for these tools

I would rather fix the tool chain before implementing half solutions - the GUM build bot is pretty complex at this point, so it's not just about generating the UD conllu.

TL;DR - I'll try to get this done for the next UD release, but can't simply push a fix right now. Moving this specific sub-issue to amir-zeldes/gum#176

martinpopel commented 1 year ago

format of the type <elem key="value"> with double straight quotes, so concretely only that glyph is a concern at the moment.

Note that <elem key="one < two"> and <elem key="one & two"> are not valid XML snippets (that said, many tools accept or even require invalid XML).

I would rather fix the tool chain

Yes, if you can fix the tool itself, it is even better than writing a wrapper for it (I though these are 3rd-party tools).

I also remember I hated this decision even then

My memory is bad (but I have an excuse - it is almost 20 years ago since I stopped using TreeTagger): I was perhaps not very happy with the TreeTagger decision of the LaTeX escapes in quotation mark lemmas, but what I hated was the decision of some taggers to accept only LRB and RRB as forms of left/right round bracket even if they didn't use the PennTB format anymore. Lemma could be considered an arbitrary id of a lexeme, after all.

amir-zeldes commented 1 year ago

to accept only LRB and RRB as forms of left/right round bracket

Well, essentially it's just a form of escaping, and it's necessary if you're using brackets to express a PTB tree in the native bracketing format. But I too prefer conllu ;)

Note that and are not valid XML snippets

True, though oddly ">" is allowed. Ampersand is needed due to entity replacement text, so that is sort of clear. But the format I'm talking about isn't actually XML - it's the CWB vertical format, which is a type of SGML with very specific restrictions (SGML is needed for various kinds of GUM markup due to nesting conflicts, where a proper XML alternative would be much messier)

rhdunn commented 1 year ago

> is allowed in XML as it doesn't mark the start of a token, or signify the end of an attribute. < could possibly be permitted in XML attributes as it is clear that it is not starting an element, but in a text body of an element it starts a new element or other instruction. IIRC, HTML allows < in quoted attributes for that reason.

UniversalDependencies / docs

Inconistent lemmatization of English punctuation #997