UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
199 stars 42 forks source link

is "bear" the right lemma for "born" ? #399

Closed lapalme closed 1 year ago

lapalme commented 1 year ago

In the training corpus, there are 23 occurrences of "born" all of which have been tagged with lemma "bear" with features "Tense=Past|VerbForm=Part" which, I think, should be "borne".

This is perhaps related to the the expression "to bear a child" as in "She has borne a child".

What is the rationale for not lemmatizing with the verb (or adjective) "born" ?

Thanks

nschneid commented 1 year ago

There's a spelling distinction between "born" and "borne" depending on the sense, but etymologically it's the same word derived from the verb "bear". https://www.etymonline.com/word/borne?ref=etymonline_crossreference

I see that some dictionaries tag "born" in the birth sense as an adjective. I wouldn't be opposed to changing to ADJ given the modern usage where "born" is restricted to an adjective-like distribution. This query suggests that "born" in the perfect construction has become quite rare, for instance.

amir-zeldes commented 1 year ago

In terms of facts I certainly wouldn't strongly argue that 'born' is a verb form, but in terms of English corpora in practice, it seems to be consistently tagged as VBN and part of a passive VP for "be born" in PTB and OntoNotes:

So even though it's a little silly, maybe it's best to just leave it alone for consistency?

lapalme commented 1 year ago

So we are stuck with this "bug" (or feature!) for eternity! Given the number of NLP systems considering these corpora as "gold" for learning and evaluation, it looks like that existing annotations have become the new norm in grammar...

amir-zeldes commented 1 year ago

I don't know about eternity... I do see your point here, but corpora will inevitably include some guidelines which are in place for reasons of consistency, and in items which are not completely grammaticalized (people can certainly still "bear children"), there is no obvious way to decide which side to put them on.

The same can be said, for example, for "X is based on Y", which is overwhelmingly more frequent than "Y bases on X", and still we treat it as a passive participle. Again, not saying that "born" is not very often more like an adjective, but I'm not sure there is much point in going after this one lexeme when there are lots of ones that are between opaque adjective and transparent participle status.

nschneid commented 1 year ago

Yeah, English UD doesn't really have a definitive theory of lemmas, just some conventions about capitalization etc. and a preference to defer to established tagging decisions where possible.

One could imagine a more deliberate policy like following WordNet for lemmas, but I guess I'm willing to accept that no matter what we do some lemmas will be debatable or surprising to users. Inconsistencies between UD English corpora, however, are worth looking at because they make the corpora harder to use collectively.

amir-zeldes commented 1 year ago

Inconsistencies between UD English corpora, however, are worth looking at because they make the corpora harder to use collectively.

Yes, if we are in the minority regarding bear, I would change it in a heartbeat!