UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

inconsistent analysis of etc #820

Closed wellington36 closed 2 years ago

wellington36 commented 2 years ago

Analyzing the expression etc in corpus Portuguese-Bosque (https://github.com/UniversalDependencies/UD_Portuguese-Bosque/issues/386) we identified inconsistencies of this annotation in other UD corpus:

sylvainkahane commented 2 years ago

That's a very complicate word that does not fit the distribution of any other word. They are called extenders by Overstreet 2005. In the spoken French we analyzed them as CCONJ, even if they are not equivalent to coordinating conjunctions. (More precisely et and caetera are analysed as ADVs and the idiom they form as a CCONJ.) http://match.grew.fr/?corpus=SUD_French-Rhapsodie@latest&custom=618c2df1e11a8

Overstreet M. (2005). And stuff und so: Investigating Pragmatics Expressions in English and German. Journal of Pragmatics 37, 1845–1864.

nschneid commented 2 years ago

For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X. But it attaches as ~cc~ conj.

A related thing that has been discussed but not resolved is the structure of "et al."—the options being

sylvainkahane commented 2 years ago

"etc" is a loan word in English, not a foreign word. X is not a good option. Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

amir-zeldes commented 2 years ago

For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X. But it attaches as cc.

You mean as conj right? That's how it is in both EWT and GUM

it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on"

Exactly, I think one of the reasons for this analysis, at least coming from GUM which has entity and coreference annotation, is that it behaves like a plural coordinate phrase and can corefer with one. So we can have:

"etc" is a loan word in English, not a foreign word

In the English corpora, the xpos tag FW is usually automatically converted to X, it's only 'foreign' because the PTB guidelines treated it this way. I agree it's not ideal, but I'm not sure if it's worth making the correspondence with xpos more piecemeal by changing this specific word's upos tag (though it doesn't matter too much to me personally)

arademaker commented 2 years ago

Why not following the German HDT and split et/cc cetera/noun? That is, etc is a MWT.

The second case of @nschneid right?

arademaker commented 2 years ago

Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

But what upostag to use? That is why I prefer split "et cetera"

nschneid commented 2 years ago

Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.

But what upostag to use? That is why I prefer split "et cetera"

Just "et cetera"? Are other abbreviations split as well? In the English tokenization we only split off clitics.

dan-zeman commented 2 years ago

This issue starts to overlap with #181 (and possibly also #112 and #516).

sylvainkahane commented 2 years ago

Two complements about the CCONJ analysis of "etc". Semantically "etc" contains the meaning of "and": "A, B, etc" is always a (semantic) conjunction (as opposed to the disjunction "A or B"). Syntactically, "etc" excludes other CCONJs: "A and B", "A, B, etc", but *"A and B etc". This mutual exclusion between "etc" and other CCONJs can allow us to consider that they belong to the same distributional class, even if "etc" occupies another position. Of course, "etc" does not share all the properties of CCONJs, but it is the best choice among a list of bad choices. X is a no-choice. ADV does not make sense, "etc" as nothing in common with ADVs that modify a verb or an adjective. NOUN is worst, "etc" cannot occupy nominal positions and it can close any coordination (I would like to dance, jump, etc). PUNCT is used for written symbols that have only a suprasegmental counterpart in spoken language.

arademaker commented 2 years ago

@dan-zeman is right, this issue is part of the #181, should we close it here and continue there? I can't see etc tagged as ADV in Portuguese, but I may be wrong. We have 14 cases in Bosque. In #181, @manning was against splitting et cetera but that would solve the tag problem considering the analysis from @sylvainkahane above.

amir-zeldes commented 2 years ago

Changing the tokenization for etc. would be a pretty radical break with LDC and other corpus behavior in English, so I would be strongly against it, and as @nschneid points out it is a slippery slope opening a huge number of questions regarding what to split or not to split (we also don't split acronyms, and I don't see that 'etc' is fundamentally different)

Latin "cetera" is a plural adjective meaning "remaining", so if it's not a foreign word, then I suppose it could be tagged with upos ADJ, but it's not that X offends me that much - the guidelines state that it is used with tokens that "for some reason cannot be assigned a real part-of-speech category", and I think it's OK that that guideline is fairly vague. As @sylvainkahane pointed out, it is basically a sui generis, so no other tag fits well. In any case, "etc" seems more complex than the simple integrated loan word example of "sombrero":

https://universaldependencies.org/u/pos/X.html

Happy to move this to #181 if preferred.

manning commented 2 years ago

I think I agree with almost everything @sylvainkahane writes, except that I don't come down on the side of CCONJ.

One word or two

Yes, "etc." has a history whereby it comes from two Latin words. But it just doesn't seem a good synchronic analysis to say that it should be two words. Would we next split up "another" because it comes from two English words? I think most linguists regard it as a mistake to try to preserve diachrony in a synchronic description. Evidence for it being one word synchronically includes:

Syntax

No one has argued against the current analysis and @sylvainkahane's argument here for conj: "Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj." This does seem to me the best way to treat it in the syntax. Treating it as cc would look very odd and not capture the idea of there being conjoined things. If you compare the two sentences "I'll bring sheets, etc." and "I'll bring sheets, towels". Then I think we are best off representing both of them with a conj: sheets --conj--> etc. and sheets --conj--> towels.

Part of speech

Several of the choices are definitely wrong:

The two plausible candidates correspond to the two halves of the meaning of "etc.": CCONJ or NOUN. I think we do have to accept that "etc." is a weird special word, and anything we do is shoving it into some category or another. @sylvainkahane gives the case for CCONJ. But I think we are better off calling it a NOUN:

nschneid commented 2 years ago

Hmm. What about the argument that it can coordinate with non-nominals? "We need to mow the lawn, weed the garden, paint the mailbox, etc.". "Bees swarmed everywhere—inside the hive, above the tree, etc."

Also, unlike other nouns, it must be the last element in a coordination.

Non-Latinate paraphrases:

It seems to me that no standard POS is a great fit because "etc." has a very special distribution (last element of a coordination of any type). I could see this being an argument to call it X (or ADV, in systems where that is the garbage category).

nschneid commented 2 years ago

This seems like a derived sense meaning "other miscellaneous/non-notable things", but of course nouns get derived from other parts of speech all the time.

nschneid commented 2 years ago

Another idea is to call it cc:postconj, by analogy to cc:preconj ("both...and", "(n)either...(n)or"). On this analysis it is not a conjunct but a marker that follows the last conjunct to refine its meaning as a non-exclusive list. A downside of this is that it can occur after just one item ("The box contains books etc."). So it would be weird to say that is postcoordination "etc." (perhaps there it could be called ADV/advmod there—cf. postmodifying "plus" as in "1 year plus").

nschneid commented 2 years ago

Though I can't find any instances in GUM or EWT, "both" can also occur post-coordination: "We invited him and her both" (meaning 'We invited both him and her'). So that would be another potential justification for cc:postconj.

lauma commented 2 years ago

Latvian doesn't use etc. particularly often, but there are two common abbreviations we would like to annotate in similar manner: 1) u.c. from un citi 'and others' 2) utt. from un tā tālāk 'and so on'

There are also couple rarer, u.t.j.p. (un tā jo projām 'and so on'), v.tml. (vai tamlīdzīgi 'or similar'), u.tml. (un tamlīdzīgi 'and similar'), thus, after much discussion we just assigned separate tag (yd, that is, abbreviations serving as discourse markers) for them in our local tagset.

For UD needs we currently convert them to SYM with role conj, and the same way we annotate if some texts in our corpus use etc. SYM tag was born out of pure desperation and lack of understanding, how to treat it in UD style, but for conj our thinking was that usually these small abbreviations end some kind of list by indicating that the written list is incomplete and enlists only some of the items writer was thinking about. That is, Latvian thinking was that abbreviation works as the final element of the list.

Anyway, I am very interested in the final conclusions of this discussion :)

manning commented 2 years ago

I agree , @nschneid, that the fact that you can use “etc.” with things other than nominals is an argument against calling it a noun (though we do get unlike category coordination in English and you might possibly regard the verbal cases as ellipsis of “and [do] other things”). And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention. I agree that choosing CCONJ is also reasonable. I still suspect NOUN might be best. While in general “etc.” is final, one other usage to consider is that it can be repeated: “We’ll need sleeping bags, tents, water bottles, etc., etc.”

nschneid commented 2 years ago

If people insist on viewing it as nominal I would think PRON would make more sense than NOUN. It is vaguely similar to "everything-else"—both in meaning, and in that it doesn't have a plural ending despite referring to multiple items.

But it also can't do things that nominals normally do, like head NPs (absent coordination), or be the antecedent for anaphora.

amir-zeldes commented 2 years ago

“and [do] other things”

I find this argument convincing for NOUN, and I guess actually that type of ellipsis would probably work in Latin as well (VP etc.), and there too it would be superficially an unlike coordination, but "cetera" would remain an adjective.

PRON would make more sense than NOUN. It is vaguely similar to "everything-else"

Mm, if we agree it's essentially a nominal I would prefer noun, I think it would be odd to say that it's a loan-pronoun just for the semantic reason that it is unspecific, and typologically loan-pronouns are quite rare. I also don't think it's considered a pronoun in Latin despite being semantically vague, and there are also some oddities about its use, such as repeatability ("etc. etc.") which don't really fit that profile.

nschneid commented 2 years ago

And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention.

In the spirit of putting all options on the table, we could also consider PART. It is like a function word in that it only occurs in a particular grammatical construction. PART is essentially "miscellaneous function word".

nschneid commented 2 years ago

How wedded are we to the cc:preconj relation for "both X and Y"? I ask because it always felt weird to me to call those CCONJ just because they are elements of a coordinating construction, as they are not the elements that link the conjuncts, but rather markers that refine the nature of the coordination.

FWIW, CGEL (p. 1305) calls "both" and "either" determinatives (as the POS) whether they occur in determiner position of an NP, or "function as marker of the first coordinate in correlative coordination". I.e.: CGEL does NOT consider "both" or "either" to be coordinating conjunctions when they occur within coordinate structures.

cc:preconj is perhaps too specific anyway as it applies to only a few lemmas.

If we were to decide that elaborations of a coordination relation are not CCONJ or cc(:preconj), but rather (say) ADV/advmod, this would bear on what we do for "etc."

amir-zeldes commented 2 years ago

cc:preconj relation for "both X and Y"?

I think that should be a separate issue, both because "etc." is an issue for many languages mentioned above which may or may not have similar problems with "both" and because I'd like to get to a decision on etc. I don't think this is too related, because "etc." is the last member of a coordination chain (i.e. it is one of the coordinates itself) and these premodifiers are something different (not members of the coordination itself).

The more I think about it the more I agree with @manning , I basically think it is interchangeable with "the rest" (a NOUN) or "others" (in English, due to the s-plural, also a NOUN by virtue of the NNS -> guidelines):

For me all of these work the same and argue for NOUN. English UD data has only three lemmas tagged PART: not, infinitive to and the genitive 's. I think putting "etc." on the same list would be odd, and considering how tricky this has turned out to be, I think there's nothing too wrong about NOUN (effectively making it be a way of saying "rest" or "others"). It's a simple solution that doesn't take too much explaining. If we agree it's deprel conj then a tag CCONJ is unexpected IMO, since that would mean the POS is determined by an internal dependent (etymological "et") and not the internal head ("cetera").

LarsAhrenberg commented 2 years ago

In the Swedish treebanks etc. and etcetera are currently consistently coded as ADV/conj. The choice of ADV I think is motivated by the usual argument that ADV is a category for words that don't fit elsewhere (as also @nschneid said) and it is what the dictionaries say. My proposal is that the treatment of etcetera can be language-specific and based on comparable words/phrases in the language, to the extent that they can be found.

In Swedish it can be compared to och så vidare, abbreviated commonly as osv. which mirrors the German und so weiter and usw, but also to med mera, abbreviated mm. m.m. or mm and med flera, abbreviated m.fl. or mfl. These however are introduced by an ADP (German mit, English with) and if spelled out would have a head with the dependency nmod or obl as the case may be. The function is quite similar to etc, however, as it ends or disrupts a listing of phrases. For this reason I would support a sub-dependency such as postconj.

A general argument to the English discussion: In UD function words usually count less than content words. Thus it is a bit odd that the part-of-speech for the abbreviations should be based on the first part (CCONJ or ADP) rather than what follows (ADV, NOUN or PRON).

nschneid commented 2 years ago

The more I think about it the more I agree with @manning , I basically think it is interchangeable with "the rest" (a NOUN) or "others" (in English, due to the s-plural, also a NOUN by virtue of the NNS -> guidelines):

But this is a semantic argument, because it is also paraphrasable with "and so on" or "and more", neither of which are nominal.

In fact, depending on the syntactic status of what is coordinated, these may sound better than "and the rest" or "and others":

"Trinidadian, Jamaican, and so on" sounds reasonable. I suppose you could awkwardly say "Trinidadian, Jamaican, and other nationalities" but that creates a new mention that can serve as an antecedent ("... Those other nationalities are..."). I don't think you can do that with "etc.":

nschneid commented 2 years ago

"I bought Alice an apple, Bob a banana, Caleb a carrot, etc." - to paraphrase that with nouns you'd have to say "and other people other things"? The point is that "etc." can be used even with conjuncts that are not traditional constituents!

amir-zeldes commented 2 years ago

But this is a semantic argument

It's not just a semantic argument, since the etymology is literally a coordination of a nominalized neuter plural adjective, and it's easier to explain it as a noun than, say, a verb, even though it could stand for either in coordination

"Trinidadian, Jamaican, and so on" sounds reasonable

So are you saying it should be ADV? I honestly don't feel very passionately about this word (except maybe opposing PRON and PART, since those are currently nice, small, closed classes in the English data), so I could live with that if that is the consensus. But if there are mostly contexts where a noun paraphrase works best and a few rare ones for ADV, I'd tend to go with the more common version, especially if it matches the etymology (easier to explain to people that it translates to "and the rest", rather than saying we equated it with "and so on", which is not really related).

nschneid commented 2 years ago

If we're not considering it a foreign word or tokenizing it as two words I don't see how etymology is relevant. "Etc." to English speakers is probably not quite the same as "et cetera" to Latin speakers.

I would be fine with ADV or possibly CCONJ or PART. I just don't see how "etc." fits any of the standard distributional tests for NOUN in English.

dan-zeman commented 2 years ago

This should be addressed in the universal guidelines but it should be made clear there that the UPOS tag is not necessarily the same in all languages (while the conj deprel probably can be used everywhere), especially if they have their own equivalent instead of the Latin loanword. For example, the Czech equivalent is atd., standing for a tak dále “and so further”. It is tagged ADV in the Czech corpora (http://hdl.handle.net/11346/PMLTQ-L8ZB), presumably because both tak and dále are adverbs. On the other hand, I don't think that this necessarily applies to English and I find NOUN quite acceptable among all the bad options for English etc.

amir-zeldes commented 2 years ago

Since I agree with conj, I'm also OK with NOUN for English, since it very (most?) often coordinates with nominals, and NOUN is more or less the most generic choice (similar to "and stuff"). I can change it at least in GUM, but EWT should ideally be the same.

nschneid commented 2 years ago

Etc. occurs at the end of coordinations. Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)

dan-zeman commented 2 years ago

Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)

Perhaps all Chinese classifiers?

amir-zeldes commented 2 years ago

Etc. occurs at the end of coordinations

I think that's natural, because it contains a word meaning "and" (which is why it gets deprel conj). I think it basically corresponds to a combination of CCONJ+HEAD, where the HEAD etymologically corresponds to an adjectival phrase ("the remaining"), and which in context can be coordinate with anything (incl. not just nouns, as in "books etc.", but also VPs, as you and others discussed above). If that's right, then it should be tagged like HEAD (same as acronyms), but if we want a single tag for this word, then we need to make a concrete choice.

Of the options NOUN, ADJ and VERB, I think NOUN is among the more generic choices, and basically corresponds to saying "and stuff", or "and the rest". I don't think ADJ is terrible either, but in terms of distribution I find both better than VERB or ADV (for the latter, it doesn't specify something like manner of a predicate or intensity of some adjective, and doesn't stand before either, the typical functions and positions of an adverb); for ADJ I would note there is no adjectival comparative or negation, so it is perhaps better to choose NOUN from multiple perspectives.

nschneid commented 2 years ago

Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)

Perhaps all Chinese classifiers?

Interesting, I didn't realize that. But at least those are modifiers within NPs right?

What about PART, as it is a category for syntactically exceptional items? Possessive 's occurs only at the end of an NP and infinitive marker to only at the beginning of a clause, and these do not share the wider distribution of other categories in English.

dan-zeman commented 2 years ago

Perhaps all Chinese classifiers?

Interesting, I didn't realize that. But at least those are modifiers within NPs right?

I suppose so.

aryamanarora commented 2 years ago

Just adding another data point: the Punjabi translational equivalent ਆਦਿ ādi I tagged as PART since it takes no nominal declensions, has no apparent gender, only occurs at the end of coordinations--it doesn't seem to type well with any other part of speech. It also doesn't really have the same weirdness of et cetera as a potentially foreign word, since Sanskrit loans are common and fully incorporated into the lexicon in Punjabi.

amir-zeldes commented 2 years ago

As I mentioned above, currently the inventory of PART in English is only the negation "not", infinitive "to" and the genitive "'s". All three are highly common, indeclinable function words; adding "etc.", which is a learned loan-item, seems out of place in that list, and also makes it a bit odd that it is coordinated so often with nouns (we say "dogs etc." but not "to etc.", "not etc." or "'s etc.") - of course coordination doesn't have to occur between like items, but it most often does.

If anyone is curious, here is the distribution of the coordinate item in GUM:

NOUN 10 PROPN 1 ADJ 1 VERB 1

Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):

https://www.dictionary.com/browse/etcetera

nschneid commented 2 years ago

If anyone is curious, here is the distribution of the coordinate item in GUM:

NOUN 10 PROPN 1 ADJ 1 VERB 1

Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV. (I say "roughly" because some of them look like annotation errors.)

Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):

https://www.dictionary.com/browse/etcetera

That's the spelled-out version which can be pluralized as "etceteras". For "etc." it merely says "abbreviation", which is a cop-out IMO. :) https://www.dictionary.com/browse/etc

Anyway I agree that "etc." is not as frequent as other PART items, but is frequency a necessary criterion? I thought PART was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.

Regarding coordination, I think there are multiple constructions at play:

nschneid commented 2 years ago

Oh I realized another thing: In its post-coordination use, there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc. Not by pluralizing it, as you would expect if it were nominal (*I bought an apple, a banana, a carrot, many etceteras), and not by adding an intensifier, as you would expect for an adjective or adverb (*I bought an apple, a banana, a carrot, very etc.).

This repetition is not just a marginal thing, BTW: COCA has >2k hits for "etc etc".

amir-zeldes commented 2 years ago

Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV

OK, but if we have to choose one, then it looks like EWT supports NOUN too

We ate cake, drank beer, etc.: I would consider this the main use

Based on frequencies, the main use is for lists of nominals (18/24 in GUM, I missed a few earlier because I forgot to search without the period too)

It would not be crazy to call it CCONJ along similar lines as cc:preconj items "both"/"neither"/"either" being tagged CCONJ

This idea will run into problems when there is only one item before "etc", as in "books etc." CCONJ basically operates in patterns like "X CCONJ/cc Y/conj", and in the cc:preconj pattern in "CCONJ/cc:preconj X CCONJ/cc Y/conj". If we only have "X etc.", then it is not clear what CCONJ is functioning as a coordinator for: we are missing the second conjunct IMO which is what licenses the coordinating conjunction.

there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc.

Sure, but I don't see how that would rule out a noun. I can say "all day it was just letters letters letters" and I don't think that detracts from "letters" being a noun (and here too, I would attach them via conj)

is frequency a necessary criterion? I thought PART was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.

Traditionally I think PART is something like a wastebasket for things that don't fit elsewhere (and seem to usually be indiclinable). In some languages they form organic classes based on some criterion (for example the Classical Greek particles, which unlike adverbs obey Wackernagel's Law).

But TBH I have never felt that UD English need upos=PART at all; in my opinion the best upos for those three items would have been:

The last one is maybe more debatable, but all of them look more plausible to me as particles than "etc.", maybe also because they are closed class items (function words, as you say), whereas "etc." is a scholarly loan, which although unique, seems to come from an open borrowing process (I don't want to see words like "op. cit.", "ibid." and "scil." or who knows what else creep into the particle class). I fully agree that "etc." is odd, but essentially I think having a noun that only appears in coordinations is less odd than a particle that only appears in coordinations, and actually shares some properties with referring expressions.

nschneid commented 2 years ago

Maybe "etc." started out as a scholarly loan—and the way we write it as an abbreviation reminds us of that—but I think ordinary people use it in spoken conversation with no idea of its Latin origins, and it is something of a function word even though we don't traditionally think of it when making lists of function words.

That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.

Agreed that "op. cit.", "ibid.", etc. (ha) are not a good fit for PART, and it's hard to imagine anyone using them without knowing they're scholarly jargon borrowed from Latin.

amir-zeldes commented 2 years ago

then the correct tag would be X

I'm OK with that too.

Whether it's a borrowing or not should be irrelevant

Sorry, I didn't mean that the fact it's a borrowing is relevant, my intention was to say that, as a loanword, it comes from an open-ended process, and my expectation is that PART is a closed class. I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.". That's why I strongly prefer one of the open pos classes for "etc." (but that doesn't mean it has to be NOUN or ADJ; X is fine by me if you think that's better, and actually reflects xpos better).

nschneid commented 2 years ago

I'm less opposed to X than @manning and @sylvainkahane are. I agree with them in principle that it's a well-integrated word of English, but given that it doesn't seem to pattern distributionally like any other word of English, and it's often spelled as an abbreviation reflecting its origin, X may be a reasonable approach in practice.

That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi.

Yes, borrowings are more likely to end up in an open class, but if it now patterns distributionally like a closed-class item (or rather, unlike any open class item) I don't think the etymology should be relevant for choosing between non-X tags.

amir-zeldes commented 2 years ago

it doesn't seem to pattern distributionally like any other word of English

I think that's just because it's an acronym, no? It distributes pretty similarly to "and + NOUN", and based on the general most common treatment of acronyms in UD as stand-ins for their heads, tagging it as NOUN doesn't seem so strange to me. But if that's controversial then X is fine for me too, as I said.

That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi

Agreed, I don't know Punjabi and I'm definitely not making any statements on how it should be tagged in other languages, especially ones where formal morphology plays a more significant role in choosing POS categories. Just for English, I think it behaves most similarly to an acronym standing for "and + NOUN".

Stormur commented 2 years ago

That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.

That's how etc is currently annotated in Latin treebanks using them, especially UDante (medieval, literary Latin). Features are applied to better frame it, specifically Abbr=Yes to acknowledge its origin and Compound=Yes to give back its structure. The choice of X is a kind of (literal) crux desperationis, since, as has been discussed here, it cannot really be assigned to anything else, and already in Latin it becomes very questionable if it can be segmented into its components (et CCONJ 'and' and caeter-, neuter plural of undeterminable case from caeterus DET 'further (ones)'), let alone in other languages where it has been borrowed into. I agree with the dependency relation of conj and think that this is a rather uncontroversial choice.

I am opposed to chose any lexical part of speech for etc, given that this "word" has a maximally generic applications. Since I however think that X is the true "wastebasket" of parts of speech, once we abandon any idea of segmenting it and consider it a single unity, I can envision only one other choice which would make me feel more in harmony with the annotational universe:

I don't think this would be a problem: each abbreviation has its own history. Moreover, the problem with etc is that it has acquired its own life and cannot be truly analysed as its components anymore, and especially not as any other abbreviation, i.e. as simple graphical variant.

nschneid commented 2 years ago

A writeup of the various points of view on "etc."

It was decided that, despite the unusual distribution, NOUN is the least objectionable tag, and conj is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").

nschneid commented 2 years ago

Documented:

Stormur commented 2 years ago

A writeup of the various points of view on "etc."

It was decided that, despite the unusual distribution, NOUN is the least objectionable tag, and conj is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").

I have to admit I am quite perplexed by this final choice, even after reading the final writeup. If we can agree that etc and similar "words" are on the functional side, as their stated generic anaphoricity strongly suggests, then I do not see why PRON could not be appropriated, being the functional counterpart of NOUN. It surely has a very specific distribution; but it surely has a deictic nature and it also ties in well with its contrastive/indefinite origin, if this has some role (as the choice of ADV for usw = und so weiter in German points to):

where ceterus is currently tagged as a DET with contrastive meaning (PronType=Con) in Latin (also an indefinite reading might be available). But in general, I think that all such terms should follow a unified annotation as long as they behave the same, as they seem to do.

I do not know if this derives from some generic resistance against opening the PRON class to some "non canonical" (i.e. non personal) elements, but etc seems a perfect candidate; the biggest vulnus for me anyway is to see it associated to a lexical class. I do not get this objection from the writeup:

In general, I think the speaker is suggesting a few members of a list and implying more and there is usually no anaphoric relation where the context or text provides other referents.

Is this really so different from indefinite pronouns like some?

dan-zeman commented 2 years ago

where ceterus is currently tagged as a DET with contrastive meaning (PronType=Con) in Latin

The discussion in the guidelines group was mostly (although not entirely) about the use of the word in English, where it is a loanword but many speakers no longer perceive it as code switching. It is somehow assumed/hoped that the decision will be applicable to other languages that use etc. as a loanword, although it hasn't been discussed thoroughly (I think Swedish was mentioned as an example). I suppose that Latin has the liberty to treat the expression as what it really is etymologically, given that it is not a loanword there.

PRON was indeed discussed as one of the options. None of the options was welcomed as a good solution, so instead of endlessly repeating the same objections back and forth, we gradually eliminated them one-by-one through voting. NOUN survived.

nschneid commented 2 years ago

Is this really so different from indefinite pronouns like some?

In EWT at least we consider some to be a DET, and someone to be a PRON.

Honestly the only thing we all agree on is that there is no good category for "etc." (in English anyway). It's sort of functional, and associated mainly with coordination, but doesn't seem as grammatically "core" as pronouns, and doesn't exist in a paradigm, which is why I think PRON seemed unintuitive (and PART). Nouns like "other" and "rest" can also have similar meanings. In reality, maybe it lies somewhere in between NOUN and PRON. Somebody should do a distributional corpus study and write a paper on it!