Closed wellington36 closed 2 years ago
That's a very complicate word that does not fit the distribution of any other word. They are called extenders by Overstreet 2005. In the spoken French we analyzed them as CCONJ, even if they are not equivalent to coordinating conjunctions. (More precisely et and caetera are analysed as ADVs and the idiom they form as a CCONJ.) http://match.grew.fr/?corpus=SUD_French-Rhapsodie@latest&custom=618c2df1e11a8
Overstreet M. (2005). And stuff und so: Investigating Pragmatics Expressions in English and German. Journal of Pragmatics 37, 1845–1864.
For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X
. But it attaches as ~cc
~ conj
.
A related thing that has been discussed but not resolved is the structure of "et al."—the options being
flat:foreign
(treating it as a foreign idiom) and conj
(like "etc.")cc
(treating "et" as a conjunction) and conj
for "al.""etc" is a loan word in English, not a foreign word. X is not a good option.
Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj
.
For English, as I understand it, the idea is that "etc" is a foreign word hence upos of X. But it attaches as cc.
You mean as conj
right? That's how it is in both EWT and GUM
it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on"
Exactly, I think one of the reasons for this analysis, at least coming from GUM which has entity and coreference annotation, is that it behaves like a plural coordinate phrase and can corefer with one. So we can have:
"etc" is a loan word in English, not a foreign word
In the English corpora, the xpos tag FW is usually automatically converted to X, it's only 'foreign' because the PTB guidelines treated it this way. I agree it's not ideal, but I'm not sure if it's worth making the correspondence with xpos more piecemeal by changing this specific word's upos tag (though it doesn't matter too much to me personally)
Why not following the German HDT and split et/cc cetera/noun? That is, etc is a MWT.
The second case of @nschneid right?
Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.
But what upostag to use? That is why I prefer split "et cetera"
Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj.
But what upostag to use? That is why I prefer split "et cetera"
Just "et cetera"? Are other abbreviations split as well? In the English tokenization we only split off clitics.
This issue starts to overlap with #181 (and possibly also #112 and #516).
Two complements about the CCONJ analysis of "etc". Semantically "etc" contains the meaning of "and": "A, B, etc" is always a (semantic) conjunction (as opposed to the disjunction "A or B"). Syntactically, "etc" excludes other CCONJs: "A and B", "A, B, etc", but *"A and B etc". This mutual exclusion between "etc" and other CCONJs can allow us to consider that they belong to the same distributional class, even if "etc" occupies another position. Of course, "etc" does not share all the properties of CCONJs, but it is the best choice among a list of bad choices. X is a no-choice. ADV does not make sense, "etc" as nothing in common with ADVs that modify a verb or an adjective. NOUN is worst, "etc" cannot occupy nominal positions and it can close any coordination (I would like to dance, jump, etc). PUNCT is used for written symbols that have only a suprasegmental counterpart in spoken language.
@dan-zeman is right, this issue is part of the #181, should we close it here and continue there? I can't see etc tagged as ADV in Portuguese, but I may be wrong. We have 14 cases in Bosque. In #181, @manning was against splitting et cetera
but that would solve the tag problem considering the analysis from @sylvainkahane above.
Changing the tokenization for etc. would be a pretty radical break with LDC and other corpus behavior in English, so I would be strongly against it, and as @nschneid points out it is a slippery slope opening a huge number of questions regarding what to split or not to split (we also don't split acronyms, and I don't see that 'etc' is fundamentally different)
Latin "cetera" is a plural adjective meaning "remaining", so if it's not a foreign word, then I suppose it could be tagged with upos ADJ, but it's not that X offends me that much - the guidelines state that it is used with tokens that "for some reason cannot be assigned a real part-of-speech category", and I think it's OK that that guideline is fairly vague. As @sylvainkahane pointed out, it is basically a sui generis, so no other tag fits well. In any case, "etc" seems more complex than the simple integrated loan word example of "sombrero":
https://universaldependencies.org/u/pos/X.html
Happy to move this to #181 if preferred.
I think I agree with almost everything @sylvainkahane writes, except that I don't come down on the side of CCONJ.
Yes, "etc." has a history whereby it comes from two Latin words. But it just doesn't seem a good synchronic analysis to say that it should be two words. Would we next split up "another" because it comes from two English words? I think most linguists regard it as a mistake to try to preserve diachrony in a synchronic description. Evidence for it being one word synchronically includes:
No one has argued against the current analysis and @sylvainkahane's argument here for conj
: "Moreover it occupies the position of the last conjunct in a coordination and can commute with expressions such as "and so on". So I think it must be conj." This does seem to me the best way to treat it in the syntax. Treating it as cc
would look very odd and not capture the idea of there being conjoined things. If you compare the two sentences "I'll bring sheets, etc." and "I'll bring sheets, towels". Then I think we are best off representing both of them with a conj: sheets --conj--> etc.
and sheets --conj--> towels
.
Several of the choices are definitely wrong:
The two plausible candidates correspond to the two halves of the meaning of "etc.": CCONJ or NOUN. I think we do have to accept that "etc." is a weird special word, and anything we do is shoving it into some category or another. @sylvainkahane gives the case for CCONJ. But I think we are better off calling it a NOUN:
conj
. cc
.Hmm. What about the argument that it can coordinate with non-nominals? "We need to mow the lawn, weed the garden, paint the mailbox, etc.". "Bees swarmed everywhere—inside the hive, above the tree, etc."
Also, unlike other nouns, it must be the last element in a coordination.
Non-Latinate paraphrases:
@sylvainkahane points out "...and so on" is a valid paraphrase. Where this occurs in EWT it is advmod(on/ADV, so/ADV)
. GUM also treats both as ADV
(although it is inconsistent about which is the head).
Another option is "...and more". Where this occurs we currently tag "more" as ADJ
, though I'm not necessarily wedded to that.
It seems to me that no standard POS is a great fit because "etc." has a very special distribution (last element of a coordination of any type). I could see this being an argument to call it X
(or ADV
, in systems where that is the garbage category).
- "etcetera" can be pluralized, like a noun: "You haven't seen so many etceteras on one stage" (NYT), "Peternell encourages tossing in any other etceteras (green beans, Brussels sprouts, greens) from the holiday, too.".
This seems like a derived sense meaning "other miscellaneous/non-notable things", but of course nouns get derived from other parts of speech all the time.
Another idea is to call it cc:postconj
, by analogy to cc:preconj
("both...and", "(n)either...(n)or"). On this analysis it is not a conjunct but a marker that follows the last conjunct to refine its meaning as a non-exclusive list. A downside of this is that it can occur after just one item ("The box contains books etc."). So it would be weird to say that is postcoordination "etc." (perhaps there it could be called ADV
/advmod
there—cf. postmodifying "plus" as in "1 year plus").
Though I can't find any instances in GUM or EWT, "both" can also occur post-coordination: "We invited him and her both" (meaning 'We invited both him and her'). So that would be another potential justification for cc:postconj
.
Latvian doesn't use etc. particularly often, but there are two common abbreviations we would like to annotate in similar manner: 1) u.c. from un citi 'and others' 2) utt. from un tā tālāk 'and so on'
There are also couple rarer, u.t.j.p. (un tā jo projām 'and so on'), v.tml. (vai tamlīdzīgi 'or similar'), u.tml. (un tamlīdzīgi 'and similar'), thus, after much discussion we just assigned separate tag (yd
, that is, abbreviations serving as discourse markers) for them in our local tagset.
For UD needs we currently convert them to SYM
with role conj
, and the same way we annotate if some texts in our corpus use etc. SYM
tag was born out of pure desperation and lack of understanding, how to treat it in UD style, but for conj
our thinking was that usually these small abbreviations end some kind of list by indicating that the written list is incomplete and enlists only some of the items writer was thinking about. That is, Latvian thinking was that abbreviation works as the final element of the list.
Anyway, I am very interested in the final conclusions of this discussion :)
I agree , @nschneid, that the fact that you can use “etc.” with things other than nominals is an argument against calling it a noun (though we do get unlike category coordination in English and you might possibly regard the verbal cases as ellipsis of “and [do] other things”). And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention. I agree that choosing CCONJ is also reasonable. I still suspect NOUN might be best. While in general “etc.” is final, one other usage to consider is that it can be repeated: “We’ll need sleeping bags, tents, water bottles, etc., etc.”
If people insist on viewing it as nominal I would think PRON would make more sense than NOUN. It is vaguely similar to "everything-else"—both in meaning, and in that it doesn't have a plural ending despite referring to multiple items.
But it also can't do things that nominals normally do, like head NPs (absent coordination), or be the antecedent for anaphora.
“and [do] other things”
I find this argument convincing for NOUN
, and I guess actually that type of ellipsis would probably work in Latin as well (VP etc.), and there too it would be superficially an unlike coordination, but "cetera" would remain an adjective.
PRON would make more sense than NOUN. It is vaguely similar to "everything-else"
Mm, if we agree it's essentially a nominal I would prefer noun, I think it would be odd to say that it's a loan-pronoun just for the semantic reason that it is unspecific, and typologically loan-pronouns are quite rare. I also don't think it's considered a pronoun in Latin despite being semantically vague, and there are also some oddities about its use, such as repeatability ("etc. etc.") which don't really fit that profile.
And I certainly agree that “no standard POS is a great fit”. I think we need to choose something as a convention.
In the spirit of putting all options on the table, we could also consider PART
. It is like a function word in that it only occurs in a particular grammatical construction. PART
is essentially "miscellaneous function word".
How wedded are we to the cc:preconj
relation for "both X and Y"? I ask because it always felt weird to me to call those CCONJ
just because they are elements of a coordinating construction, as they are not the elements that link the conjuncts, but rather markers that refine the nature of the coordination.
FWIW, CGEL (p. 1305) calls "both" and "either" determinatives (as the POS) whether they occur in determiner position of an NP, or "function as marker of the first coordinate in correlative coordination". I.e.: CGEL does NOT consider "both" or "either" to be coordinating conjunctions when they occur within coordinate structures.
cc:preconj
is perhaps too specific anyway as it applies to only a few lemmas.
If we were to decide that elaborations of a coordination relation are not CCONJ
or cc(:preconj)
, but rather (say) ADV
/advmod
, this would bear on what we do for "etc."
cc:preconj relation for "both X and Y"?
I think that should be a separate issue, both because "etc." is an issue for many languages mentioned above which may or may not have similar problems with "both" and because I'd like to get to a decision on etc. I don't think this is too related, because "etc." is the last member of a coordination chain (i.e. it is one of the coordinates itself) and these premodifiers are something different (not members of the coordination itself).
The more I think about it the more I agree with @manning , I basically think it is interchangeable with "the rest" (a NOUN) or "others" (in English, due to the s-plural, also a NOUN by virtue of the NNS -> guidelines):
For me all of these work the same and argue for NOUN
. English UD data has only three lemmas tagged PART
: not, infinitive to and the genitive 's. I think putting "etc." on the same list would be odd, and considering how tricky this has turned out to be, I think there's nothing too wrong about NOUN (effectively making it be a way of saying "rest" or "others"). It's a simple solution that doesn't take too much explaining. If we agree it's deprel conj
then a tag CCONJ
is unexpected IMO, since that would mean the POS is determined by an internal dependent (etymological "et") and not the internal head ("cetera").
In the Swedish treebanks etc. and etcetera are currently consistently coded as ADV/conj. The choice of ADV I think is motivated by the usual argument that ADV is a category for words that don't fit elsewhere (as also @nschneid said) and it is what the dictionaries say. My proposal is that the treatment of etcetera can be language-specific and based on comparable words/phrases in the language, to the extent that they can be found.
In Swedish it can be compared to och så vidare, abbreviated commonly as osv. which mirrors the German und so weiter and usw, but also to med mera, abbreviated mm. m.m. or mm and med flera, abbreviated m.fl. or mfl. These however are introduced by an ADP (German mit, English with) and if spelled out would have a head with the dependency nmod or obl as the case may be. The function is quite similar to etc, however, as it ends or disrupts a listing of phrases. For this reason I would support a sub-dependency such as postconj.
A general argument to the English discussion: In UD function words usually count less than content words. Thus it is a bit odd that the part-of-speech for the abbreviations should be based on the first part (CCONJ or ADP) rather than what follows (ADV, NOUN or PRON).
The more I think about it the more I agree with @manning , I basically think it is interchangeable with "the rest" (a NOUN) or "others" (in English, due to the s-plural, also a NOUN by virtue of the NNS -> guidelines):
But this is a semantic argument, because it is also paraphrasable with "and so on" or "and more", neither of which are nominal.
In fact, depending on the syntactic status of what is coordinated, these may sound better than "and the rest" or "and others":
"Trinidadian, Jamaican, and so on" sounds reasonable. I suppose you could awkwardly say "Trinidadian, Jamaican, and other nationalities" but that creates a new mention that can serve as an antecedent ("... Those other nationalities are..."). I don't think you can do that with "etc.":
"I bought Alice an apple, Bob a banana, Caleb a carrot, etc." - to paraphrase that with nouns you'd have to say "and other people other things"? The point is that "etc." can be used even with conjuncts that are not traditional constituents!
But this is a semantic argument
It's not just a semantic argument, since the etymology is literally a coordination of a nominalized neuter plural adjective, and it's easier to explain it as a noun than, say, a verb, even though it could stand for either in coordination
"Trinidadian, Jamaican, and so on" sounds reasonable
So are you saying it should be ADV? I honestly don't feel very passionately about this word (except maybe opposing PRON and PART, since those are currently nice, small, closed classes in the English data), so I could live with that if that is the consensus. But if there are mostly contexts where a noun paraphrase works best and a few rare ones for ADV, I'd tend to go with the more common version, especially if it matches the etymology (easier to explain to people that it translates to "and the rest", rather than saying we equated it with "and so on", which is not really related).
If we're not considering it a foreign word or tokenizing it as two words I don't see how etymology is relevant. "Etc." to English speakers is probably not quite the same as "et cetera" to Latin speakers.
I would be fine with ADV
or possibly CCONJ
or PART
. I just don't see how "etc." fits any of the standard distributional tests for NOUN
in English.
This should be addressed in the universal guidelines but it should be made clear there that the UPOS tag is not necessarily the same in all languages (while the conj
deprel probably can be used everywhere), especially if they have their own equivalent instead of the Latin loanword. For example, the Czech equivalent is atd., standing for a tak dále “and so further”. It is tagged ADV
in the Czech corpora (http://hdl.handle.net/11346/PMLTQ-L8ZB), presumably because both tak and dále are adverbs. On the other hand, I don't think that this necessarily applies to English and I find NOUN
quite acceptable among all the bad options for English etc.
Since I agree with conj
, I'm also OK with NOUN for English, since it very (most?) often coordinates with nominals, and NOUN is more or less the most generic choice (similar to "and stuff"). I can change it at least in GUM, but EWT should ideally be the same.
Etc. occurs at the end of coordinations. Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)
Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)
Perhaps all Chinese classifiers?
Etc. occurs at the end of coordinations
I think that's natural, because it contains a word meaning "and" (which is why it gets deprel conj
). I think it basically corresponds to a combination of CCONJ+HEAD, where the HEAD etymologically corresponds to an adjectival phrase ("the remaining"), and which in context can be coordinate with anything (incl. not just nouns, as in "books etc.", but also VPs, as you and others discussed above). If that's right, then it should be tagged like HEAD (same as acronyms), but if we want a single tag for this word, then we need to make a concrete choice.
Of the options NOUN, ADJ and VERB, I think NOUN is among the more generic choices, and basically corresponds to saying "and stuff", or "and the rest". I don't think ADJ is terrible either, but in terms of distribution I find both better than VERB or ADV (for the latter, it doesn't specify something like manner of a predicate or intensity of some adjective, and doesn't stand before either, the typical functions and positions of an adverb); for ADJ I would note there is no adjectival comparative or negation, so it is perhaps better to choose NOUN from multiple perspectives.
Are there other examples of NOUNs that occur in lexically productive combinations, but just in one position of one particular construction? (Not hapaxes in frozen expressions like kith and kin.)
Perhaps all Chinese classifiers?
Interesting, I didn't realize that. But at least those are modifiers within NPs right?
What about PART
, as it is a category for syntactically exceptional items? Possessive 's occurs only at the end of an NP and infinitive marker to only at the beginning of a clause, and these do not share the wider distribution of other categories in English.
Perhaps all Chinese classifiers?
Interesting, I didn't realize that. But at least those are modifiers within NPs right?
I suppose so.
Just adding another data point: the Punjabi translational equivalent ਆਦਿ ādi I tagged as PART
since it takes no nominal declensions, has no apparent gender, only occurs at the end of coordinations--it doesn't seem to type well with any other part of speech. It also doesn't really have the same weirdness of et cetera as a potentially foreign word, since Sanskrit loans are common and fully incorporated into the lexicon in Punjabi.
As I mentioned above, currently the inventory of PART
in English is only the negation "not", infinitive "to" and the genitive "'s". All three are highly common, indeclinable function words; adding "etc.", which is a learned loan-item, seems out of place in that list, and also makes it a bit odd that it is coordinated so often with nouns (we say "dogs etc." but not "to etc.", "not etc." or "'s etc.") - of course coordination doesn't have to occur between like items, but it most often does.
If anyone is curious, here is the distribution of the coordinate item in GUM:
NOUN 10 PROPN 1 ADJ 1 VERB 1
Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):
If anyone is curious, here is the distribution of the coordinate item in GUM:
NOUN 10 PROPN 1 ADJ 1 VERB 1
Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV. (I say "roughly" because some of them look like annotation errors.)
Also wanted to add to @manning 's dictionary survey that dictionary.com concurs with Merriam Webster in labeling it as a noun (and listing the plural from @manning 's example as well):
That's the spelled-out version which can be pluralized as "etceteras". For "etc." it merely says "abbreviation", which is a cop-out IMO. :) https://www.dictionary.com/browse/etc
Anyway I agree that "etc." is not as frequent as other PART
items, but is frequency a necessary criterion? I thought PART
was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.
Regarding coordination, I think there are multiple constructions at play:
Oh I realized another thing: In its post-coordination use, there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc. Not by pluralizing it, as you would expect if it were nominal (*I bought an apple, a banana, a carrot, many etceteras), and not by adding an intensifier, as you would expect for an adjective or adverb (*I bought an apple, a banana, a carrot, very etc.).
This repetition is not just a marginal thing, BTW: COCA has >2k hits for "etc etc".
Not as overwhelmingly skewed in EWT—roughly 45 NOUN+PROPN, 10 VERB, 3 ADJ, 2 ADV
OK, but if we have to choose one, then it looks like EWT supports NOUN too
We ate cake, drank beer, etc.: I would consider this the main use
Based on frequencies, the main use is for lists of nominals (18/24 in GUM, I missed a few earlier because I forgot to search without the period too)
It would not be crazy to call it CCONJ along similar lines as cc:preconj items "both"/"neither"/"either" being tagged CCONJ
This idea will run into problems when there is only one item before "etc", as in "books etc." CCONJ basically operates in patterns like "X CCONJ/cc Y/conj", and in the cc:preconj
pattern in "CCONJ/cc:preconj X CCONJ/cc Y/conj". If we only have "X etc.", then it is not clear what CCONJ is functioning as a coordinator for: we are missing the second conjunct IMO which is what licenses the coordinating conjunction.
there is a standard way to emphasize the magnitude of the "etc."—by repeating it: I bought an apple, a banana, a carrot, etc. etc.
Sure, but I don't see how that would rule out a noun. I can say "all day it was just letters letters letters" and I don't think that detracts from "letters" being a noun (and here too, I would attach them via conj
)
is frequency a necessary criterion? I thought PART was basically for words that are extremely constrained and exceptional grammatically, and tend on the functional side.
Traditionally I think PART
is something like a wastebasket for things that don't fit elsewhere (and seem to usually be indiclinable). In some languages they form organic classes based on some criterion (for example the Classical Greek particles, which unlike adverbs obey Wackernagel's Law).
But TBH I have never felt that UD English need upos=PART at all; in my opinion the best upos for those three items would have been:
The last one is maybe more debatable, but all of them look more plausible to me as particles than "etc.", maybe also because they are closed class items (function words, as you say), whereas "etc." is a scholarly loan, which although unique, seems to come from an open borrowing process (I don't want to see words like "op. cit.", "ibid." and "scil." or who knows what else creep into the particle class). I fully agree that "etc." is odd, but essentially I think having a noun that only appears in coordinations is less odd than a particle that only appears in coordinations, and actually shares some properties with referring expressions.
Maybe "etc." started out as a scholarly loan—and the way we write it as an abbreviation reminds us of that—but I think ordinary people use it in spoken conversation with no idea of its Latin origins, and it is something of a function word even though we don't traditionally think of it when making lists of function words.
That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be X
. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.
Agreed that "op. cit.", "ibid.", etc. (ha) are not a good fit for PART, and it's hard to imagine anyone using them without knowing they're scholarly jargon borrowed from Latin.
then the correct tag would be X
I'm OK with that too.
Whether it's a borrowing or not should be irrelevant
Sorry, I didn't mean that the fact it's a borrowing is relevant, my intention was to say that, as a loanword, it comes from an open-ended process, and my expectation is that PART is a closed class. I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.". That's why I strongly prefer one of the open pos classes for "etc." (but that doesn't mean it has to be NOUN or ADJ; X is fine by me if you think that's better, and actually reflects xpos better).
I'm less opposed to X
than @manning and @sylvainkahane are. I agree with them in principle that it's a well-integrated word of English, but given that it doesn't seem to pattern distributionally like any other word of English, and it's often spelled as an abbreviation reflecting its origin, X
may be a reasonable approach in practice.
That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi.
Yes, borrowings are more likely to end up in an open class, but if it now patterns distributionally like a closed-class item (or rather, unlike any open class item) I don't think the etymology should be relevant for choosing between non-X
tags.
it doesn't seem to pattern distributionally like any other word of English
I think that's just because it's an acronym, no? It distributes pretty similarly to "and + NOUN", and based on the general most common treatment of acronyms in UD as stand-ins for their heads, tagging it as NOUN doesn't seem so strange to me. But if that's controversial then X is fine for me too, as I said.
That doesn't address @aryamanarora's point, though, where the equivalent word is not salient as a borrowing in Punjabi
Agreed, I don't know Punjabi and I'm definitely not making any statements on how it should be tagged in other languages, especially ones where formal morphology plays a more significant role in choosing POS categories. Just for English, I think it behaves most similarly to an acronym standing for "and + NOUN".
That said, if we wanted to have a simple rule that abbreviations borrowed from Latin do not fit in any normal English POS category, then the correct tag would be
X
. Whether it's a borrowing or not should be irrelevant to choosing between NOUN, CCONJ, and PART.
That's how etc is currently annotated in Latin treebanks using them, especially UDante (medieval, literary Latin). Features are applied to better frame it, specifically Abbr=Yes
to acknowledge its origin and Compound=Yes
to give back its structure. The choice of X
is a kind of (literal) crux desperationis, since, as has been discussed here, it cannot really be assigned to anything else, and already in Latin it becomes very questionable if it can be segmented into its components (et CCONJ
'and' and caeter-, neuter plural of undeterminable case from caeterus DET
'further (ones)'), let alone in other languages where it has been borrowed into. I agree with the dependency relation of conj
and think that this is a rather uncontroversial choice.
I am opposed to chose any lexical part of speech for etc, given that this "word" has a maximally generic applications. Since I however think that X
is the true "wastebasket" of parts of speech, once we abandon any idea of segmenting it and consider it a single unity, I can envision only one other choice which would make me feel more in harmony with the annotational universe:
PART
: this moves from the fact a participle is, as it were, the epitome of the functional word. etc (and all its cousins, like the Greek κτλ = και τα λοιπά and many others) is pure function indeed: it is just an expander of a co-ordinated series of any length and at the same time acts as its closing. It does not have any autonomous meaning whatsoever: as said, it is maximally general. So, if I were to give etc citizenship among POSs, my first choice would be PART
. I could easily imagine other loans might behave idiosyncratically, and I wouldn't want them to seep into PART because we opened the door with "etc.".
I don't think this would be a problem: each abbreviation has its own history. Moreover, the problem with etc is that it has acquired its own life and cannot be truly analysed as its components anymore, and especially not as any other abbreviation, i.e. as simple graphical variant.
A writeup of the various points of view on "etc."
It was decided that, despite the unusual distribution, NOUN
is the least objectionable tag, and conj
is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").
A writeup of the various points of view on "etc."
It was decided that, despite the unusual distribution,
NOUN
is the least objectionable tag, andconj
is the appropriate deprel even if coordinated with things other than nominals (cf. "We went swimming, hiking, and other things").
I have to admit I am quite perplexed by this final choice, even after reading the final writeup. If we can agree that etc and similar "words" are on the functional side, as their stated generic anaphoricity strongly suggests, then I do not see why PRON
could not be appropriated, being the functional counterpart of NOUN
. It surely has a very specific distribution; but it surely has a deictic nature and it also ties in well with its contrastive/indefinite origin, if this has some role (as the choice of ADV
for usw = und so weiter in German points to):
where ceterus is currently tagged as a DET
with contrastive meaning (PronType=Con
) in Latin (also an indefinite reading might be available). But in general, I think that all such terms should follow a unified annotation as long as they behave the same, as they seem to do.
I do not know if this derives from some generic resistance against opening the PRON
class to some "non canonical" (i.e. non personal) elements, but etc seems a perfect candidate; the biggest vulnus for me anyway is to see it associated to a lexical class. I do not get this objection from the writeup:
In general, I think the speaker is suggesting a few members of a list and implying more and there is usually no anaphoric relation where the context or text provides other referents.
Is this really so different from indefinite pronouns like some?
where ceterus is currently tagged as a
DET
with contrastive meaning (PronType=Con
) in Latin
The discussion in the guidelines group was mostly (although not entirely) about the use of the word in English, where it is a loanword but many speakers no longer perceive it as code switching. It is somehow assumed/hoped that the decision will be applicable to other languages that use etc. as a loanword, although it hasn't been discussed thoroughly (I think Swedish was mentioned as an example). I suppose that Latin has the liberty to treat the expression as what it really is etymologically, given that it is not a loanword there.
PRON
was indeed discussed as one of the options. None of the options was welcomed as a good solution, so instead of endlessly repeating the same objections back and forth, we gradually eliminated them one-by-one through voting. NOUN
survived.
Is this really so different from indefinite pronouns like some?
In EWT at least we consider some to be a DET, and someone to be a PRON.
Honestly the only thing we all agree on is that there is no good category for "etc." (in English anyway). It's sort of functional, and associated mainly with coordination, but doesn't seem as grammatically "core" as pronouns, and doesn't exist in a paradigm, which is why I think PRON seemed unintuitive (and PART). Nouns like "other" and "rest" can also have similar meanings. In reality, maybe it lies somewhere in between NOUN and PRON. Somebody should do a distributional corpus study and write a paper on it!
Analyzing the expression
etc
in corpus Portuguese-Bosque (https://github.com/UniversalDependencies/UD_Portuguese-Bosque/issues/386) we identified inconsistencies of this annotation in other UD corpus:English (EWT and GUM): use upos equal to X.
German (HDT): separate etc in
et
andcetera
.French (ParTUT, GSD and Sequoia): varies between INTJ (ParTUT), X and ADV (GSD) and ADV (Sequoia).
Spanish (AnCora and GSD): varies between PUNCT (AnCora) and ADV (GSD).
Italian (ISDT and VIT): varies between ADV (ISDT) and NOUN (VIT).