Insight into when to create a class of MWT vs not?

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

273 stars 248 forks source link

Insight into when to create a class of MWT vs not? #1006

Closed AngledLuffa closed 6 months ago

AngledLuffa commented 11 months ago

Looking at the documentation for MWT

https://universaldependencies.org/format.html#words-tokens-and-empty-nodes

There is an explanation of what MWT are, but not really when to apply MWT vs not. I think it might be useful to have some explanation of when labeling something with MWT would be the appropriate choice as opposed to keeping it as a single token. For example, in English we have n't suffixes split from words, but not ab or un as a prefix, even though those also mean not. It might be that there's an element of "is this a word by itself", although that isn't always the case, such as in the possessive 's

nschneid commented 11 months ago

"n't" and "'s" are clitics, i.e. they are somewhere between bound affixes and full phonological words. It is not based on the meaning. https://universaldependencies.org/u/overview/tokenization.html notes that MWTs are suitable for clitics.

amir-zeldes commented 11 months ago

I think it's also a separate question whether something should be tokenized apart (in English: n't -> yes, un- ->no) and whether the split parts should be unified under an MWT. For the tokenization criterion, there is a lot of literature on what 'words' are and aren't, but in the context of corpus annotation it's a question of the level of description targeted by the token layer. For words, interruptability and mobility are often key criteria. In the case of UD, tokens must carry POS tags, so it makes sense for tokens to be the things that have POS tags.

I think for the 's this is forced in English due to separability ("the man I saw yesterday's dog"), while for "n't", the identify of lemma and POS to the separately spelled case force the issue - if "not" is an ADV, it's unclear why "n't" wouldn't be. Meanwhile "un-" has no POS, cannot appear by itself, and cannot be separated from its host via another token, so English morphology views it as a clear affix and not a word. More borderline cases in English are "-free" (lead-free) or "-wise" (e.g. income-wise), which can sometimes be spelled apart. These are sometimes given a separate term (affixoid or similar) and in corpora only get tokenized/tagged when they are separated by whitespace.

Finally as mentioned in the page Nathan linked, it's my impression MWTs are often used to unite tokens where either one is a clitic despite opacity (English), when phonological fusion results in non-opaque reduced forms (French "au") or in languages where whitespace separates complex units, usually with one content item surrounded by function words, or with incorporation, and this shows up for example in languages of the Middle East and parts of Africa (Arabic لعمنا li-ammi-na, "to-our-uncle", similarly Hebrew, Coptic). Not sure if there are other prominent/common use cases.

AngledLuffa commented 11 months ago

Thanks for the clarifications. Although I think this is saying tokens & words backwards:

In the case of UD, tokens must carry POS tags, so it makes sense for tokens to be the things that have POS tags

Would it be worth adding some of this analysis to the tokenization page? I can do that as a PR, I suppose

amir-zeldes commented 11 months ago

People can't seem to agree on which is which... For me, tokenized corpora predate UD, and tokens have POS tags. This is true in PTB as well, where MWTs don't exist. So I call the small things tokens and the big things MWTs. I try to avoid the term 'word' in a technical sense. Feel free to PR but I think the page more or less already says that?

nschneid commented 11 months ago

In the guidelines, "token" means orthographic word and "word" means syntactic word (i.e. something with a deprel).

jnivre commented 11 months ago

Yep, that’s right! This is of course stipulative, but it has been the official policy of UD from the start, so I think we have to stick to it.

From: Nathan Schneider @.> Reply to: UniversalDependencies/docs @.> Date: Wednesday, 13 December 2023 at 21:03 To: UniversalDependencies/docs @.> Cc: Subscribed @.> Subject: Re: [UniversalDependencies/docs] Insight into when to create a class of MWT vs not? (Issue #1006)

In the guidelines, "token" means orthographic word and "word" means syntactic word (i.e. something with a deprel).

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/1006#issuecomment-1854623182, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVXNGXCWS6XL5BMFZSTYJICXRAVCNFSM6AAAAABATNGI3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGYZDGMJYGI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

amir-zeldes commented 11 months ago

Yes, I've run into this before, apologies for the confusion. I definitely have no expectation of this changing, but I find it pretty disorienting, since I also work with a lot of people who use and build non-UD corpora, and they generally understand "word" to be something space-delimited and not formally defined, and tokens to be the things that corpora annotate (e.g. with POS tags) and which are used to measure their sizes, e.g. "the WSJ Corpus has 1,209,785 tokens" (=things that have POS tags).

FWIW, the UD usage of tokens is definitely contrary to what you find in other reference works, e.g. the Handbook of Corpus Linguistics chapter on "Tokenizing and part-of-speech tagging":

https://www.degruyter.com/document/doi/10.1515/9783110211429.toc/html

And in general in NLP and parsing, tokenization refers to separating the units that get tagged and combined in trees, not to whitespace tokenization of bigger words, which is sometimes a separate preprocessing step. Other terms sometimes found are "supertokens" (for things like MWT) and "subtokens", for things smaller than the units that get tagged. Some libraries, like udapi, do use the expected UD terminology of course, where nodes have 'words', but I generally find this causes confusion when collaborating with non-UD folks.

sylvainkahane commented 11 months ago

I think that we must separate terms concerning linguistic notions and terms concerning the annotation. "token" is not a term used by linguists. It is related to the annotation. Not to use this term to name the basic units of the annotation would be too strange. And deprel in a dependency-based annotation are between tokens, whatever are these units.

But now, from the linguistic point of view, we can choose various units as tokens. Standard UD annotation is lexeme-based (including tokens for punctuation signs). But we develop treebanks based on other tokenisations. For instance, when we annotate French, we use the orthographic words as tokens at first. Which means that au is a token annotated ADP. In a second time, with a Grew rule we decompose au in two lexemes (à/ADP le/DET) and we have another treebank with another tokenisation (how can I say that if you pre-empt the term token). On the other hand, for Beja, we started with morph-based tokenisation, because our input was an IGT (interlinear glossed text). We annotated this and, in a second time, we changed the tokenisation (again with Grew rules) and obtained a lexeme-based (or word-based) treebank.

amir-zeldes commented 11 months ago

@sylvainkahane Yes, this all sounds fine to me - I suspect you are using "token" the same way I was taught to use it, meaning "the smallest unit of analysis", and tokenization means "breaking data into those units".

I agree that we can talk about multiple or even conflicting tokenizations. But the UD term "token" seems to refer to the smallest level when there is no MWT, or the MWT level when there is a MWT. The term "word" is used in UD for the things that have POS tags. Since those are the smallest units of analysis in UD (not counting use of MSeg or similar within sub-tokens), many non-UD people think that "token" would mean the POS-bearing units (single number IDs, whether inside a MWT or not). As for "word", I think non-UD people typically think of orthographic units (whitespace separated), and do not consider punctuation marks, reconstructed sub-parts of MWTs or fused sub-forms to be words (e.g. the English UD "word" wo in the MWT won't). In non-UD corpora, wo is called a token, but not a word.

Stormur commented 11 months ago

I think that a big part of the confusion comes from fields where there is not so much consideration for the linguistic aspect and so the formal term of token is used for just about everything. I feel UD's usage is quite spot-on in reserving the linguistic term word for the more linguistic leve of analysis (even if it includes punctuation marks and the like), and token for the more formal (orthographic) one. It is better than talking about "super-" or "subtokens", which does not remove their vagueness.

This use becomes counterintuitive only when too much meaning is attached to such an ultrageneric term which just stands for "unit", whose definition needs to be given each time. This probably depends on how the concept is taught: dogmatic ("a token is just this"), according to personal experience, vs definitorial ("a convenient term for a unit which needs to be defined"). Same considerations are valid for word.

sylvainkahane commented 11 months ago

I think it is time to change the terminology and clarify it. UD started with mostly NLP people, but we are now involving more and more linguists. Linguistics has its terminology and NLP has its. We must avoid terms such as "multi word token" or "multi token word" (see https://universaldependencies.org/u/overview/tokenization.html) because it mixes two terminologies. The term "word" is already used by linguists (and ordinary people) to name "orthographic words" or "word-forms" (that is words from a syntactic/phonological point of view) (there is also a well-known problem concerning the difficulty to define exactly what is a word-form, but it is not the problem here, let us suppose we know how to define what is a word-form for every language). We cannot give a third sense to this term, it will kill everybody. The UD tokenisation is not word-based, it is lexeme-based. In the page on tokenisation, there is the French example au /o/. This is what Mel'cuk call a megamorph. Its signifier has only one phoneme and cannot be segmented. But we can consider that it merge two lexemes (à le 'to the'). It is multi-lexeme megamorph. This megamorph is also a word in both senses of the term (orthographic word and word-form). We can also say that au is a multi-lexeme word. This is linguistics and this is independent of UD. The term "token" must not be used here. We were doing linguistics. Now UD, which is an annotation scheme, makes its choice and introduces additional notions. One important notion concerns the vertices in the dependency graph. We cannot us the term "word" for that, it is a linguistic term already used. The best choice seems to be "token".

nschneid commented 11 months ago

How about we adopt the term "syntactic word" shortened to "sword". 🗡️ ;)

In all seriousness, it seems like this would fall under the purview of UniDive, and could be pursued there. I would love for CL & MWE folks to converge on a more careful set of terminology for different notions of wordhood (that would be more in line with the rest of linguistics). From my perspective on the UD end of things, I worry that a change to core technical terminology would introduce more confusion than it solves.

P.S. I love the term "megamorph"!

Stormur commented 11 months ago

Probably this deserves to be discussed. But going from token/word to things like "multi-lexeme megamorph" only serves to truly create mayhem in my opinion.

The UD tokenisation is not word-based, it is lexeme-based.

I do not understand well this statement, because all the point is exactly this: "word" and "token" are general terms which need to be defined every time. UD has its version of word which I think is very straightforward and fits e.g. with the "syntactic/phonological point of view". This statement only makes sense given some previously fixed definition of "word", which however does not appear to be that of UD, so it is moot.

One important notion concerns the vertices in the dependency graph. We cannot us the term "word" for that, it is a linguistic term already used. The best choice seems to be "token".

The most sensible and neutral term to use when considering the (mathematical) graph structure is node (or vertex, yes). But if we are considering the linguistic material, I see no problem in speaking about words: the two coincide, at different levels, since syntactic words are represented by nodes, and nodes are only syntactic words in UD's formalism. While token is at another level still.

It is just a matter of giving definitions and understanding that the same things might be named differently, though equivalently according to the adopted point of view.

How about we adopt the term "syntactic word" shortened to "sword". 🗡️ ;)

Yes!

sylvainkahane commented 11 months ago

I don't understand why we should introduce a new term such as "syntactic word" for what is generally called a "lexeme". Moreover if you take the French sentence "je vais au lit" 'I go to the bed', and you ask a linguist or anybody else (except maybe someone from UD ;) how many (syntactic) words the sentence has, they will answer 4. Nobody will say that "à" and "le" are words in this sentence. But the word "au" merges the two lexemes "à" and "le". Is there a problem I don't see with the term "lexeme"?

Stormur commented 11 months ago

Because "syntactic word" does not meant "lexeme", at least not in UD.

Sorry if I repeat known definitions, but in broad terms:

a syntactic word in UD is that specific occurrence which is represented as a node in the tree representation and which enters into specific dependency relations with other similar elements.
a lexeme is a set of form types which are considered together for various reasons (semantics, morphology, ...).

They are completely different concepts: a single element vs a set. I have no problem with the use of the term "lexeme", given this usual distinction, or any other useful definition.

Anyway, the whole discussion just confirms that there is always the need to define these concepts clearly, and then the definition can be whatever it is as long it has some levels of reasonability. There is no Platonic, hyperuranic fixed definition of "word", "token", or "lexeme".

if you take the French sentence "je vais au lit" 'I go to the bed', and you ask a linguist or anybody else (except maybe someone from UD ;) how many (syntactic) words the sentence has, they will answer 4. Nobody will say that "à" and "le" are words in this sentence. But the word "au" merges the two lexemes "à" and "le".

I am pretty sure there will be many linguists answering five, or at the least (for the despair of all non-linguists in the room - true story) going into the painstaking details of why from one point of view we might say 4, from another 5, from yet another maybe even 3 (don't we have some arguments in favour of considering je vais a word? Surely somebody has come up with the idea at some point)... The fundamental point: it depends on the point of view.

amir-zeldes commented 11 months ago

I don't love any of the terms, but I think "syntactic word" is an improvement over word, since newcomers to the project will be better able to understand that it's not the same as what they understand by "word". I agree with @sylvainkahane that no one thinks "au" is two words, just like no one thinks "wo" is a word in English "won't". Token works well for these when teaching Intro NLP, because the students don't yet have a fixed notion of what that means, but I also agree with @Stormur that strictly speaking, token just means "an instance of the unit of analysis", so it is a contextually defined term.

sylvainkahane commented 11 months ago

Because "syntactic word" does not meant "lexeme", at least not in UD.

@Stormur Can you give me a counter-example?

The term "lexeme" is indeed traditionally defined as a set of inflected forms, called lexes by certain linguists. Exactly as a morpheme is a set of morphs. The form won't merges an occurrence of the lexeme WILL with one of the lexeme NOT; the French form au merges an occurrence of the lexeme À with one of the lexeme LE. UD trees, as well as traditional dependency trees, are lexeme-based.

dan-zeman commented 11 months ago

I don't understand why we should introduce a new term such as "syntactic word"

It is not a new term. Richard Sproat (Computational Morphology, 1992, p. 69): ... it is a fairly traditional observation in morphology that there are really two kinds of words from a structural point of view, namely phonological words and syntactic words. These two notions specify overlapping but nonidentical sets of entities, in that something which is a phonological word may not count as a single word from the point of view of the syntax (Matthews 1974, p. 32). (The full reference is Matthews P. 1974. Morphology. Cambridge University Press.)

sylvainkahane commented 11 months ago

Yes it is well known that phonological criteria does not define the same units that syntactic criteria. For instance, clitics are words from the syntactic point of view but not from the phonological point of view. See Haspelmath (2011) for a review of criteria for the definition of words and Tallman (2020) for an interesting discussion on the graduality of the notion of wordhood (the more you consider criteria, the more restrictive is the notion of word). In some sense, the fact that some authors use the term "syntactic word" to clearly distinguish word-forms from phonological words could be an additional problem. I don't know what is called a "syntactic word" in @dan-zeman's citation and whether it includes or not units such as won't or au. But if you prefer to use "syntactic word" rather than lexeme, let's do it. It means that "syntactic word" is used in UD for the forms of a lexeme. And when two lexemes have merged in one form, the "syntactic word" will be the part of the merged form supposed to belong to the lexeme (such as wo in won't or na in gonna) or a reconstructed form of the lexeme (such as le in au). It would be nice to introduce the term "lexeme" in the UD pages to make clear what you want to call a syntactic word.

martinpopel commented 11 months ago

It would be nice to introduce the term "lexeme" in the UD pages to make clear what you want to call a syntactic word.

I think syntactic words (and the fact that in UD they are called just words for short) are quite clearly described in the Tokenization and Word Segmentation guidelines and in the relevant part of the CoNLL-U description. We all agree that both token and word can have many other meaning in other areas (most notably, in modern NLP, tokens mean subword units trained with BPE/Sentencepiece/...), but in UD the meaning is clearly defined from the beginning this way and I don't see much sense in endless discussions about it when it would be very difficult (I would say impossible) to change (thousands of references, APIs,...). So I would just say that newcomers should be recommended to read the UD guidelines, including the two pages I linked above.

I was very surprised to learn than even the term lexeme has more meanings than I expected. In my linguistic classes, I have learned the definition described e.g. at https://en.wikipedia.org/wiki/Lexeme (perhaps similarly to @Stormur). I have never heard the definition mentioned by @sylvainkahane, which seems to fit rather to particular instances of word forms or even to nodes in dependency trees. Most probably, I don't understand this definition correctly, but it seems obvious there are big differences in the two definitions of a lexeme. For example, I would not say lexeme "le". The lexeme with lemma le is a set of word forms including le, les, la, l'.

For this reason, I don't think it is a good idea to complicate the guidelines by introducing the term lexeme (especially, in this non-Wikipedia definition).

Stormur commented 11 months ago

Because "syntactic word" does not meant "lexeme", at least not in UD.

@Stormur Can you give me a counter-example?

The term "lexeme" is indeed traditionally defined as a set of inflected forms, called lexes by certain linguists. Exactly as a morpheme is a set of morphs. The form won't merges an occurrence of the lexeme WILL with one of the lexeme NOT; the French form au merges an occurrence of the lexeme À with one of the lexeme LE. UD trees, as well as traditional dependency trees, are lexeme-based.

I would not say it is appropriate to speak of counterexamples, since we are dealing with different entities by definition.

Given any sentence, like

This isn't a counterexample.

in a typical UD analysis we encounter the 5 syntactic words This, is, not, a, counterexample (represented by 4 tokens). The syntactic word This is a node in the graph representation and a row with a given formatting of its linguistic annotation in the CoNLL-U represntation. It is a single occurrence of what we can identify as a lexeme labeled as THIS which usually is taken to comprise forms this and these, with their respective linguistic properties, maybe also that and those, maybe also more aberrant graphical variations like theese and whatnot, etc.

Syntactic word: a single occurrence in context
Lexeme: a set of linguistic entities, usually types, considered together as a more abstract concept

In a sense, each syntactic word/node in a syntactic tree/row in a CoNLL-U file is the instantiation of a lexeme in context. The lexeme is a superordinate entity with respect to the single words.

I would not say that UD dependency trees are lexeme-based: morphological properties and dependency relations, and the identification of "minimal syntactic units", is rather independent from the concept of lexeme. Where this concept really comes into play (and it is admittely a very relevant one) is lemmatisation, and secondarily the choice of part of speech.

Observing French au, we can identify that it contains a relator (the ADP) and a mark of definitenes (the article). We see the synatctic independence of the two also given more transparent forms like à la, or single occurrences of each. Then, this is independent from the fact that we consider le and la belonging to te same lexeme or not, a fact which we synthesise assigning the same or different lemmas (you might be surprised, for example, that some Italian treebanks assign different lemmas to all forms of personal pronouns).

I am all in favour of introducing a discussion of the notion of lexeme in the guidelines somewhere: it is a useful and pervasive concept, and sometimes I feel it is not given the right importance.

sylvainkahane commented 11 months ago

@martinpopel @Stormur You're right that a lexeme and a an instance of this lexeme are different things. I don't think I introduced a different definition than the classical one. For instance, the word le and the lexeme LE are different things (le = LE_sing,masc). The fact is that each node in a UD dependency tree (except punctuations) corresponds to one and only one lexeme. It is why we can say (and I say) that such a structure is lexeme-based. I think we all agree on that now. Now, instances/elements of a lexeme are traditionnaly called lexes. This term is not widely used and if the UD community prefer to call them syntactic words or swords, it is not a problem, as long as we make the link with the traditional linguistic terminology and the notion of lexeme and lexe.

amir-zeldes commented 11 months ago

I may be wrong, but I think @martinpopel and possibly @Stormur 's uses of lexeme are basically what UD means by 'lemma', whereas @sylvainkahane is using it similarly to 'morpheme', which is a minimal unit of meaning in a morphological structure analysis (so 'lexeme' in this usage would be the equivalent of 'morpheme' above the morphological level, hence at the syntactic level). And then 'lexe' is like 'morph' (or 'phone' vs. 'phoneme' in phonology). Or do you mean something else?

sylvainkahane commented 11 months ago

The lemma is the citation form of a lexeme. There are different entities. A lexeme is a linguistic unit. The lemma is just its conventional name. But it is true that they are often confused. People often uses the term morpheme instead of morph and would they that displacement is composed of three morphemes dis, place, and ment, while they should say three morphs to be really rigorous. In the same way, it is not very rare to see lexeme used instead of lexe.

martinpopel commented 11 months ago

I think @martinpopel and possibly @Stormur 's uses of lexeme are basically what UD means by 'lemma',

Lexeme is a set of words (word forms) that are represented by the same lemma. (And for the lovers of circular definitions: lemma is a representant of a given lexeme.) So there is a 1-1 mapping between lexemes and lemmas, but these are not the same.

whereas @sylvainkahane is using it similarly to 'morpheme',

I guess no. There are many UD (syntactic) words consisting of multiple morphemes (although linguists never agree on the number of morphemes and their boundaries in some words), at least prefixes and suffixes should be considered separate morphemes. There are several ongoing projects which try to annotate UD with morph(eme) segmentation or even introduce dependencies among morphemes, but such projects go beyond UD.

each node in a UD dependency tree (except punctuations) corresponds to one and only one lexeme. It is why we can say (and I say) that such a structure is lexeme-based.

Similarly, each node corresponds to one and only one UPOS, deprel and line (in the CoNLL-U format), so we can say, that UD is UPOS-based, deprel-based or line-based. It would be true, but not much helpful to the readers.

I would not be surprised if there are linguists who consider even some UD MWTs a single lexeme (i.e. one of the forms of that lexeme). So I am afraid that just saying "lexeme-based" is no magical formula which makes every newcomer understand the distinction between word and tokens in UD.

sylvainkahane commented 11 months ago

@martinpopel I will try to answer seriously to your remarks even if I suppose that you were joking.

Lexeme is a set of words (word forms) that are represented by the same lemma. (And for the lovers of circular definitions: lemma is a representant of a given lexeme.)

Your definition of a "lexeme" is wrong, because you cannot define the notion of "lemma" before defining the lexeme. It is just a nonsense. You may think I'm splitting hairs, but it's important if you want to understand my next remark. It is fundamental in sciences, and especially in linguistics, to avoid circularity and to be be aware in which order the notions are defined.

A lexeme is a paradigm of words that correspond to the same lexical meaning (https://en.wikipedia.org/wiki/Lexeme). This definition supposes that the notion of word is defined before. And it is why you suppose that it is useless to introduce the notion of lexeme if we can say that our segmentation is word-based. But it is not true that the UD segmentation is word-based. It is based on a more complex notion, that @nschneid has proposed to call a "sword" and which I think is based on the notion of lexeme. In fact, the Wikipedia definition of "lexeme" is incomplete and problematic for fusional languages. Sometimes there is one element of a lexeme that does not form a word by itself, but is fused with an element of the paradigm of another lexeme. It is the case of TO in gonna, of WILL in won't, or À in au. It will impossible (I think) to say that gonna is two swords, if you don't have the notion of paradigms of signs and if you cannot say that in to and gonna you have two occurrences of the same unit (that we call TO). So we have: word => lexeme => sword. And the UD notion of "sword" is not the traditional notion of "syntactic word", it is a more complex notion which is based on the notion of "paradigm of signs" and of "lexeme".

Similarly, each node corresponds to one and only one UPOS, deprel and line (in the CoNLL-U format), so we can say, that UD is UPOS-based, deprel-based or line-based. It would be true, but not much helpful to the readers.

This is a very unfair remark. When I say that the UD segmentation is lexeme-based, I say that you need to know what are the lexemes in your sentence to segment it. Of course, you don't need to know what are the UPOS or the deprels.

nschneid commented 11 months ago

@sylvainkahane I think you're assuming a phonological definition of "word" if you say that sometimes two lexeme-elements are fused into one word. That's fine—it's just that UD (for better or worse) calls that concept a "token".

And the UD notion of "sword" is not the traditional notion of "syntactic word", it is a more complex notion which is based on the notion of "paradigm of signs" and of "lexeme".

Is there an example where the traditional notion of "syntactic word" would apply but UD would not treat it as a word, or vice versa?

Stormur commented 11 months ago

Similarly, each node corresponds to one and only one UPOS, deprel and line (in the CoNLL-U format), so we can say, that UD is UPOS-based, deprel-based or line-based. It would be true, but not much helpful to the readers.

This is a very unfair remark. When I say that the UD segmentation is lexeme-based, I say that you need to know what are the lexemes in your sentence to segment it. Of course, you don't need to know what are the UPOS or the deprels.

I would not say it is unfair, it is the same remark I wuld have done. When you segment, you deal with single "forms", as they are often called, and only after you have forms the notion of lexeme comes in.

@martinpopel @Stormur You're right that a lexeme and a an instance of this lexeme are different things. I don't think I introduced a different definition than the classical one. For instance, the word le and the lexeme LE are different things (le = LE_sing,masc).

This is problematic because you want to attach morphological features to lexemes, but in this framework they are pertinent only to single words. It might be that all forms in a lexeme share some feature, but this is observed bottom-up, not top-down. What does it mean exactly LE_sing.masc if the lexeme labeled as LE also contains la and les?

I feel we are not getting to understand each other here, since on the one hand you seem to agree on the same "primitive" notion of lexeme we are describing, on the other I cannot relate this to some of the reasonings you are putting forth.

The fact is that each node in a UD dependency tree (except punctuations) corresponds to one and only one lexeme. It is why we can say (and I say) that such a structure is lexeme-based. I think we all agree on that now.

Yes, each node correspond, or better, belongs only to one lexeme in that we label it with one and only one lemma. This does not make UD syntactic trees "lexeme-based", it is actually the other way round: lexemes are groupings of forms, but how we group them is independent from their identification as forms and of their morphological features or syntactic relations. This is what I wanted to convey with the example in a previous post. We could group them as we want. In fact, there are some treebanks giving distinct lemmas to all forms of articles, instead of one. Or, in Latin nos 'we' and ego 'I' take different lemmas instead of being reconducted to a single one which might be taken to represent the singular first-person pronoun. Lexemes partition the set forms, but this partition has no particular influence on the annotation in the first instance, on the contrary, I would say it comes afterward. So I still do not agree on saying that UD syntactic trees are lexeme-based, at least not given the definition of lexeme we seem to converge on. But if on the contrary lexeme has to synonimous to "syntactic word", then the statement it is trivial.

Now, instances/elements of a lexeme are traditionnaly called lexes. This term is not widely used and if the UD community prefer to call them syntactic words or swords, it is not a problem, as long as we make the link with the traditional linguistic terminology and the notion of lexeme and lexe.

Are you perhaps referring to what somewhere else is called "hypolemma" (or better "hypolexeme") such as e.g. the base form of a participle inside the greater paradigm of a verb, let's say Latin locutus (with its adjectival inflection) as part of the paradigm represented by loquor (including "finite" conjugations)? What would else be the difference between lexes and words defined in some way?

arademaker commented 11 months ago

On 18 Dec 2023, at 17:12, Amir Zeldes @.***> wrote:

lexeme are basically what UD means by 'lemma',

Lemma is the canonical form used to represent the Lexeme.

sylvainkahane commented 11 months ago

Can someone give me her/his definition of a word where gonna and won't are not words, but na and wo are words?

martinpopel commented 11 months ago

definition of a word where gonna and won't are not words, but na and wo are words?

This is the definition of (syntactic) words in UD according to the English-specific word segmentation guidelines, which mention that n’t (reduced form of not) will be one of the words in a multiword token (MWT) in cases like don’t, ain’t and can’t. MWTs gonna and won't are not mentioned explicitly there, but they follow the same rules (if you look into the data). I think this originates from the tokenization of PennTB and many other related projects. For example, in PennTB wsj_0413.pos I can see [ Who/WP ] [ ya/PRP gon/VB na/TO call/VB ] ?/.

That said, in PennTB, word and token seem to be taken as synonyms. See e.g. Marcus et al. (1993) saying "the Penn Treebank, a corpus consisting of over 4.5 million words", but Table 4 shows it is exactly 4,885,798 words=tokens. In the paper, words are the units which are being assigned PoS tags. See also footnote 8.

Stormur commented 11 months ago

definition of a word where gonna and won't are not words, but na and wo are words?

This is the definition of (syntactic) words in UD according to the English-specific word segmentation guidelines, which mention that n’t (reduced form of not) will be one of the words in a multiword token (MWT) in cases like don’t, ain’t and can’t. MWTs gonna and won't are not mentioned explicitly there, but they follow the same rules (if you look into the data). I think this originates from the tokenization of PennTB and many other related projects. For example, in PennTB wsj_0413.pos I can see [ Who/WP ] [ ya/PRP gon/VB na/TO call/VB ] ?/.

Even though I would say that the claim is not that na or wo specifically are words, but at most that they are particular realisations of the forms to and will when fusing with going and not. I mean, we cannot isolate wo independntly from the phonological unit won't. I think that is also the point of having two levels ("token" and "s. word").

sylvainkahane commented 11 months ago

The English-specific word segmentation guidelines doesn't contain a definition of "syntactic word". It just gives a list of examples of elements that are considered as swords by English UD treebanks. There is not even the beginnings of a criterion.

@Stormur's answer is exactly what I call a lexeme-based definition. You don't need to give a definition of "word" that isolates na and wo as words. What is said is that gonna must be cut in two parts because it contains both an occurrence of GO and of TO. So you start with a rather rough and traditional definition of "word" where gonna is a word (call it a token if you want) and you say that some words fuse two units. But gonna is not a fusion of the words going and to, because the speaker does not produce a to which is then transformed into a na. The speaker never produces a to. The speaker decides to produce "GO + present participle + TO" and this is realized by gonna. There is no word to in gonna, there is just an occurrence of the lexeme TO. (TO is a set of signs, while to is a sign whose signifier is /tu/.) My description is linked to a particular framework, the Meaning-text Theory (see Mel'cuk 1988, Dependency syntax) and you can disagree with this theoretical framework. In MTT, we consider that the speaker who wants to express a meaning, first choose some lexemes and grammatical morphemes imposed by the lexemes and then realize them by the appropriate signifier according to the context.

nschneid commented 11 months ago

The English-specific word segmentation guidelines doesn't contain a definition of "syntactic word". It just gives a list of examples of elements that are considered as swords by English UD treebanks. There is not even the beginnings of a criterion.

True, but that page is not trying to give a theoretical definition - it is just explaining how as a practical matter the English treebanks were annotated (in some cases derived from earlier treebanks). The best place to look for UD's theoretical explanation is de Marneffe et al. 2021, Section 2.2. That may not be fully satisfying, however, which would explain why one of the goals of UniDive WG2 is "harmonizing the definition of 'syntactic word' across languages".

Stormur commented 11 months ago

@Stormur's answer is exactly what I call a lexeme-based definition. You don't need to give a definition of "word" that isolates na and wo as words. What is said is that gonna must be cut in two parts because it contains both an occurrence of GO and of TO. So you start with a rather rough and traditional definition of "word" where gonna is a word (call it a token if you want) and you say that some words fuse two units. But gonna is not a fusion of the words going and to, because the speaker does not produce a to which is then transformed into a na. The speaker never produces a to. The speaker decides to produce "GO + present participle + TO" and this is realized by gonna. There is no word to in gonna, there is just an occurrence of the lexeme TO. (TO is a set of signs, while to is a sign whose signifier is /tu/.) My description is linked to a particular framework, the Meaning-text Theory (see Mel'cuk 1988, Dependency syntax) and you can disagree with this theoretical framework. In MTT, we consider that the speaker who wants to express a meaning, first choose some lexemes and grammatical morphemes imposed by the lexemes and then realize them by the appropriate signifier according to the context.

Even if I now think I am starting to better understand what you are meaning (also thanks for the reference), the MTT you cite seems to me to aim at a more cognitive explanation of language, which I think is rather unconsequential to the annotation level of UD. Or, in other words, it is something which comes at a later, higher level, but not itself something which drives morphosyntactic annotation.

For instance, how exactly the speaker produces forms is rather independent and only offers an a posteriori reason for the fact that we want to annotate two (syntactic) words in gonna as we see it alternating with going to and at the same time we observe other alternations like want to / wanna, all the while observing that going, want and to also appear independently fro meach other. If we know how to annotate their single instances, I think this is all we need to treat gonna as we do and at this level there is (yet) no intervention of the concept of lexeme, we have just forms and their combinations. By the way, I think it is a little hard to defend the fact that there is no to in gonna, or that this is no fusion, as we (morphophonologically) have a rather regular assimilation process and I imagine that it is still ungrammatical to say e.g. I am going do this... but again, these considerations are quite marginal to the annotation.

So I consider mine a form-based answer, lexemes have no role in it. Identifying going as part of a paradigm is a successive step.

sylvainkahane commented 11 months ago

It was a very long discussion and sometimes a little too abstract. I want first to resume what we are discussing here. First I want to reassure the colleagues that were lost: the discussion specifically concerns fusional languages (which includes most Indo-European languages) and the notion of so-called MWT. There was two points in this discussion: 1) a terminological point: UD uses the terms "token" and "word" in a particular way, which is quite different of most traditions in linguistics. @nschneid proposed to use the term "syntactic word" or "sword" instead of word, to avoid some confusion. I will only use "sword" in the following. Bad terminology can be very misleading but we can always accept it, if we have clear definitions. It seems that there is a large majority of UD people that are now familiar with this terminology and we won't change it. 2) a theoretical point: I claim that the notion of "sword" is not well defined and cannot be define without introducing before some other concepts, such as a more traditional notion of "word" (that I will not define now) and the notion of "lexeme", which is based on this notion of "word". And as the notion of "lexeme" is primary to the notion of "sword", it is simpler to say that the UD-segmentation is lexeme-based than saying that it is sword-based.

Let me explain the second point again by answering to one of @Stormur's last remarks.

By the way, I think it is a little hard to defend the fact that there is no to in gonna

It really depend on what we call to. If we consider the linguistic sign to, it has only one signifier and it is /tu/ (wrtitten to, often_ unaccentued and pronounced [tǝ]). This fisrt to in UD is annotated [form=to, lemma=to]. This sign is not in gonna. In the UD annotation, another element is considered that I call na, annotated [form=na, lemma=to]. Now we can consider a more abstract element with is the set TO = {to, na}. When we say that gonna is the fusion of going and to, we mean that there is a realization of something similar to to in gonna that we have called na. We can also say, by paraphrasing @Stormur, that "there is TO in gonna".

Sets like TO are called X-emes: morphemes, when it concerns minimal units, and lexemes, when it concerns lexical units. In UD and more generally in syntax, where are only concerned by lexemes.

Now if look at all the swords with [lemma=to] in UD_English-GUM (and if we exclude orthographic variations), we have in fact 4 realisations and TO = {to, na, a, ta} (gon-na, ought-a, gotta). We see here that the choice to segment gonna in gon-na, rather that gonn-a is certainly difficult to justify and that the notion of sword came after the notion of lexeme. What is clear is that the lexeme TO is part of gonna. Now what is exactly its contribution to gonna is a more complex question and maybe an irrelevant question. It becomes clear when we look at au in French (pronounced /o/), where the fusion is more profound. Here, we now that au is the fusion of À and LE, but … it is not possible to decide what is the contribution of À and LE to au and to give signifier to the swords in au. So what are exactly the swords in au? Are they linguistic signs? But a linguistic sign has always a signifier. If they are not signs, what are they? Was it really reasonable to use the term "word" for such objects?

AngledLuffa commented 11 months ago

If our goal is to reduce confusing terminology and make UD accessible, I suggest not abbreviating "syntactic word"

Stormur commented 11 months ago

This sign is not in gonna. In the UD annotation, another element is considered that I call na, annotated [form=na, lemma=to].

I do not think this is the right annotation. Also according to guidelines, the form should be to (then in this case, as for ADPs, the lemma is identical).

[This opens up the further issue that in my opinion the different annotation of na, ta, a in the forms gonna, gotta, oughta should be revised as they are inconsistent. This could also be a source of misunderstandings.]

If we consider the linguistic sign to, it has only one signifier and it is /tu/ (wrtitten to, often_ unaccentued and pronounced [tǝ]).

I would question this claim, since apparently it can also combine in different ways, in this case phonologically. The phonological level does not need to postulate a lexeme. This seems to be an aprioristic starting point.

Now we can consider a more abstract element with is the set TO = {to, na}.

I would also question this, as a consequence of a faulty segmentation. na is not part of the lexeme of to, it is a sequence which cannot be considered separately from the whole combination gonna. I do not think it is a morph. But at the same time the element gonna contains this syntactic element. It is more about syntactic functions than the exact forms themselves, this is also what I understand of "syntactic word" ( :dagger: ).

and that the notion of sword came after the notion of lexeme. What is clear is that the lexeme TO is part of gonna

I still cannot fathom this implication, I am sorry.

From a logical point of view it is simply wrong to say that a set is included in a single element, even if I can figure some flexibility in the expression.

it is not possible to decide what is the contribution of À and LE to au and to give signifier to the swords in au. So what are exactly the swords in au? Are they linguistic signs? But a linguistic sign has always a signifier. If they are not signs, what are they? Was it really reasonable to use the term "word" for such objects?

The point is, it is not really important, it is a rather moot issue. What we have is a form that alternates with à la, à l'..., aux etc. where those two elements are present. The term we are discussing about in fact is syntactic word to make this clear, not just a vaguer word.

Probably the discussion is starting to repeat itself on loop, so I will stop here and possibly wait for new participants 🙂