UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Minimal guidelines for lemmas #172

Closed spyysalo closed 9 years ago

spyysalo commented 9 years ago

The documentation still lacks a guideline for lemmas (see e.g. http://universaldependencies.github.io/docs/u/overview/morphology.html, http://universaldependencies.github.io/docs/format.html) and the data has inconsistencies even in basic use of the LEMMA field.

Although this is late (sorry!), I was wondering if we could introduce some minimal guideline for lemmas and at least attempt to improve consistency a bit for v1.1.

To open the discussion, perhaps something like this could work:

LEMMA should contain the canonical, uninflected form of the word, such as the form typically found in dictionaries.

LEMMA should always be filled, but if not available, an underscore ("_") can be used to indicate its absence.

The LEMMA field should not be used to encode features or other properties of the word (use FEATS and MISC instead).

The 2nd point documents a convention already used by many treebanks in v1 (de en es fr sv). The 3rd addresses an exceptional usage limited to UD Czech in v1.

Of v1 corpora, only UD Czech usage diverges from this minimal guideline, so input from @hajic and @dan-zeman would be particularly appreciated. Do you think something like this could work for UD Czech also?

(See also previous discussion for v1: #115)

dan-zeman commented 9 years ago

Thanks for noticing this. In fact, the Czech lemmas in PDT contain other information that is much more feature-like and that I already removed (converted to FEAT values) in v1.0. I think only two/three sort-of "features" remain:

TomazErjavec commented 9 years ago

Dne 07/05/2015 ob 15:51 je Sampo Pyysalo zapisal(a):

The documentation still lacks a guideline for lemmas (see e.g. http://universaldependencies.github.io/docs/u/overview/morphology.html, http://universaldependencies.github.io/docs/format.html) and the data has inconsistencies even in basic use of the |LEMMA| field.

Although this is late (sorry!), I was wondering if we could introduce some minimal guideline for lemmas and at least attempt to improve consistency a bit for v1.1.

To open the discussion, perhaps something like this could work:

|LEMMA| should contain the canonical, uninflected form of the
word, such as the form typically found in dictionaries.

|Better "the canonical or base form" - you can't say "uninflected" because every wordform (even if base form) is inflected. |

|LEMMA| should always be filled, but if not available, an
underscore ("|_|") can be used to indicate its absence.

No use saying "should" if you then give an out. Maybe just say "If the lemma is not available, ..."

The |LEMMA| field should not be used to encode features or other
properties of the word (use |FEATS| and |MISC| instead).

As you note CS deviates from this. I do not have good advice, just to consider than an "extended" lemma can enable you to link to the correct lexical entry, which a "bare" lemma does not (necessarily). So, it would seem a shame to throw away this information, in case it is already available.

The 2nd point documents a convention already used by many treebanks in v1 (de en es fr sv). The 3rd addresses an exceptional usage limited to UD Czech in v1.

Of v1 corpora, only UD Czech usage diverges from this minimal guideline, so input from @hajic https://github.com/hajic and @dan-zeman https://github.com/dan-zeman would be particularly appreciated. Do you think something like this could work for UD Czech also?

(See also previous discussion for v1: #115 https://github.com/UniversalDependencies/docs/issues/115)

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/172.

spyysalo commented 9 years ago

@dan-zeman : Thanks for the quick response! Just to clarify, I'm not suggesting to remove any of this information, only to represent it in fields other than LEMMA.

I don't fully see why the index distinguishing homonyms could not be moved (e.g.) to MISC, but in any case I think it would be a consistency improvement to move those pieces of information you agree can be moved. (Perhaps the remains could be a language-specific diff?)

spyysalo commented 9 years ago

@TomazErjavec : Thank you for the comments! I agree and suggest the following improved version:

LEMMA should contain the canonical or base form of the word, such as the form typically found in dictionaries.

If the lemma is not available, an underscore ("_") can be used to indicate its absence.

The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead).

dan-zeman commented 9 years ago

@spyysalo : "I don't fully see why the index distinguishing homonyms could not be moved..."

Technically it could of course. But I don't think it would make sense. For me the lemma is the identifier of the lexeme / dictionary entry. So it should be unique to just one entry.

spyysalo commented 9 years ago

For me the lemma is the identifier of the lexeme / dictionary entry. So it should be unique to just one entry.

OK, thanks for the clarification. I appreciate your position, but (unless I'm mistaken) UD Czech is the only UD treebank to adopt this use so far, and I think it's unlikely that we would be able to standardize on this due to the demands of introducing this information to the other treebanks.

How about the option of moving the other info to FEATS or MISC, documenting the index as a language-specific diff, and adopting (something like) the definition as the minimal guideline?

TomazErjavec commented 9 years ago

Dne 07/05/2015 ob 16:25 je Dan Zeman zapisal(a):

@spyysalo https://github.com/spyysalo : "I don't fully see why the index distinguishing homonyms could not be moved..."

Technically it could of course. But I don't think it would make sense. For me the lemma is the identifier of the lexeme / dictionary entry. So it should be unique to just one entry.

Well, I (also) see the reasoning why the "lemma" field (maybe a misnomer) should have as its value just the base form: you can then e.g. use it to train a lemmatiser model, without some ad hoc convention telling you how to split the base form from the lemma identifier. Maybe the idea that this should be encoded in features (even "lemma='put_2'") is not too bad. Also, most languages don't use a lemma identifier anyway, so why clutter up the system?

fginter commented 9 years ago

I think from a technical usability point of view, it would be useful to either move the disambiguating information into the MISC field and have LEMMA contain the baseform only, or alternatively agree on a divider, so the base form can be retrieved for all languages in the same manner from the lemma field.

dan-zeman commented 9 years ago

@TomazErjavec : If I train a lemmatizer model I will definitely not strip the identifier. I will want the model to learn to distinguish between jen-1 and jen-2. Not just because of different meanings. These are completely different words with different morphological paradigms. (In this particular case, even their part of speech differs - one is particle, the other noun - but in other cases there are homonyms that have the same part of speech but different paradigms.)

Adopting @spyysalo 's guideline and documenting the identifier as a language-specific diff seems a good temporary solution to me.

But (thanks to @hajic for mentioning this in another thread) there is a more general underlying issue of linking from corpora to dictionaries, subcategorization frames, ontologies etc., and we should find a more principled way of dealing with this in the next version of the UD standard. FEAT is not a good column for that and MISC is probably not too good either (in the long term), so we may later want to add a column called EXTLINKS or something...

fginter commented 9 years ago

Very good observation. We will be inserting PropBank data into UD_Finnish and this is of great relevance. I planned to use MISC, though. Adding another column to the format should IMHO be a last resort solution because it will basically break all tools written to parse the format.

dan-zeman commented 9 years ago

On the other hand: While for a lemmatizer I would want to have the disambiguated output, I can imagine an experiment trying to learn morphological rules of the language, i.e. how is each word form derived from the baseform. In that case, it would be useful to know that the lemma does not contain additional identifiers. (But then it would be even more useful to have a stem instead of lemma...)

fginter commented 9 years ago

@dan-zeman --- this is what I was aiming at when I said we should at least make sure there is a standardized delimiter so the extra information can be stripped. Stem, baseform, lemma. This will probably mean different things to different people... :)

spyysalo commented 9 years ago

@dan-zeman : thanks, great to have your support!

In partial summary of the above, I take it that there is tentative consensus from current participants here (@dan-zeman @TomazErjavec @fginter @spyysalo) to adopt the following minimal guideline and document exceptions as language-specific diffs:

LEMMA should contain the canonical or base form of the word, such as the form typically found in dictionaries.

If the lemma is not available, an underscore ("_") can be used to indicate its absence.

The LEMMA field should not be used to encode features or other similar properties of the word (use FEATS and MISC instead).

More input very welcome!

(I propose to incorporate the latest iteration of the guideline into the UD docs early next week if no disagreements arise.)

dan-zeman commented 9 years ago

At the end of the day I decided to move all the lemma extensions to the MISC column (https://github.com/UniversalDependencies/UD_Czech/commit/780392e8502c2a2319950b4c9a442eea5988a445).

spyysalo commented 9 years ago

@dan-zeman : great, thanks! Implemented as suggested.