UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Internet addresses (URLs, emails), phone numbers: PROPN vs. SYM #973

Closed nschneid closed 11 months ago

nschneid commented 1 year ago

The guidelines list these under SYM: https://universaldependencies.org/u/pos/SYM.html

But I think PROPN would be more appropriate, for several reasons:

  1. Most things tagged SYM are symbols, i.e. special characters like "$", whereas internet addresses and phone numbers are largely alphanumeric.
  2. Internet addresses and phone numbers are open-class and function syntactically and semantically like names/noun phrases.
  3. In the several treebanks I checked in Grew-match, only one actually uses SYM for internet addresses and phone numbers. (English-GUM and German-GSD use PROPN. Wolof-WTB uses NOUN and English-EWT uses X, but those are not great solutions IMO.)
Stormur commented 1 year ago

I think that PROPN is already problematic enough and I would not like to see it extended. I actually find the treatment as SYM reasonable if they cannot be split up into their individual pieces. Maybe they could need some trait to distinguish them as such (and this anyway).

By the way, I would not say that phone numbers are an open class: any kind thereof is actually closed, exactly defined by a given regular expression.

dan-zeman commented 1 year ago

I tend to agree with @Stormur. URLs seem borderline, neither a perfect SYM, nor a prototypical PROPN. SYM itself is a bit odd category because it contains things that are written in a special way (usually non-alphanumeric) but are not PUNCT and can be pronounced; but that pronounced thing, if written differently, would have its own non-SYM POS category (at least some of them; not sure about emoticons). URLs are partly (for large part actually) alphanumeric, but they always contain special characters as well, so you could say this is what makes them closer to SYM and further from words (which rarely have other special characters than hyphens, right?)

nschneid commented 1 year ago

I guess I'm not sure what linguistic benefit there is to separating URLs/email addresses from other names just because their spelling mixes in some nonalphanumeric characters. It seems to me that distributionally they are PROPNs: they can be subjects, objects, and if they're predicates in a language with copulas they would require a copula. They can be possessive if short enough ("cnn.com's privacy policy"). Is there a salient morphological difference? It seems like it would be unusual to pluralize a URL or email address, but pluralization is also unusual for certain subclasses of regular PROPNs, like first names, for semantic reasons.

amir-zeldes commented 1 year ago

I tend to agree with @nschneid , I mean we see a lot of user names and e-mails used as participants, so essentially vocatives in contexts like this: "Hi @dan-zeman !" And in Google docs the latter is often a full e-mail address. I'll also add that they are open class in the same way that proper names are - as soon as there is a new website or e-mail address, we use that to refer to that unique thing. So it seems more PROPN than SYM to me.

dan-zeman commented 1 year ago

they can be subjects, objects,

Yes. Just like other things that are tagged SYM, they can be just about anything (and the validator has to allow SYM virtually everywhere). I'm not saying I find this the best solution possible, but it's been in UD since v1. If we want to change it, why just URLs and e-mails? Why not other words? For example, $ is a noun. So is %, at least in Czech. And it would be a far-reaching change, so before adopting an amendment that will reverse the current guideline, we should carefully examine all treebanks, figure out how much work it would be, and provide a script that will help people fix it.

sylvainkahane commented 1 year ago

I agree with @nschneid and others that words that belong in the same way, that have the same, should be in the same POS. If there is a problem, it comes from SYM. For French, we have already done the job proposed by @dan-zeman. To be conform to the guidelines we kept SYM but we added an ExtPos. It appear that only 37 on 719 occurrences of SYM cannot receive another POS. See https://universal.grew.fr/?custom=651bbeaaebac5.

sylvainkahane commented 1 year ago

*that have the same distribution Most of the ExtPos have been added using Grew rules. Such rules can be provided and applied to all the treebanks as suggested by @dan-zeman. (Grew-match, that almost all UD people now know, allows you to encode a pattern and see all the trees where this pattern matches. Grew allows you to replace a pattern by another one. ArboratorGrew allows you to upload any conll file, apply a Grew rule, check the results and validate them.)

nschneid commented 1 year ago

I didn't realize this has been in the guidelines for so long...I only bring it up because I notice the guideline doesn't match what many treebanks are doing. (Including EWT, which is one of the earliest UD corpora right?) Maybe it's a sign that it's not a great fit for the rest of the category (nonalphanumeric, typically single-character symbols).

Looking at the GSD corpora (https://universal.grew.fr/?custom=651c143ebcc5e) there are a mix of strategies: SYM, PROPN, NOUN, X, even a case of PUNCT.

Is there a way to search all treebanks together in Grew-match?

dan-zeman commented 1 year ago

I don't know about Grew but cross-treebank search is possible in Teitok. Go to https://lindat.mff.cuni.cz/services/teitok/ud212/index.php?action=cqp and in the CQL Query field, put e.g. this

[word="http.*"]

Then click on the "Universal POS tag" button to see UPOS together with the words.

nschneid commented 1 year ago

Thanks! Results show tags are all over the map.

Mostly URLs:

imageimageimage

Email addresses (not as many in the data):

image
Stormur commented 1 year ago

Some thoughts:

In short, I think that if one does not want to analyse the components of an URL, it should be treated as a non-lexical element, hence as a SYM. Then, it surely needs a feature to single it out as an URL, and this is something that one wants to have even independently from its part of speech.

nschneid commented 1 year ago

Thank you for bringing up the guidelines for metalinguistic mentions. They say: "The universal POS tags should capture regular, prevailing syntactic behavior, as well as morphological characteristics when available, and should not reflect sentence-specific exceptional behavior."

What I am saying is that URLs used in a sentence ordinarily act like nominals, not that they are coerced into nominals in certain sentences (e.g. by omitting a head noun or mentioning rather than using the word).

My expectation would be that items functioning primarily as nominals should be NOUN or PROPN by default—unless they are symbols (SYM), numerals (NUM), or part of the grammatical system of nominal anaphora (PRON).

Evidently there are different opinions about the orthographic criteria for SYM. Because there are symbols that can be shorthand for nouns ("$" for "dollar", "%" for "percent"), coordinating conjunctions ("&" = "and", "/" = "or"), verbs (maybe the typical use of <3 for "love"), interjections (smiley emoticons?), etc., I like @sylvainkahane's point that this can be captured with ExtPos. This makes it possible, for example, to check compatibility with deprels, and for someone to query for all instances of a part of speech if they care about how the word is pronounced rather than written. But I still think we should come to a consensus on the criteria for SYM.

amir-zeldes commented 1 year ago

I find it questionable to consider URLs an "open class", as again they can be defined by a regular expression

Open class just means we can't enumerate all of the possibilities, unlike for example, pronouns, which are a closed set in each language. Regular expressions are really just descriptions of finite state automata, and word formation in many languages can be described using a finite state morphology, but we still don't call nouns or verbs in those languages 'closed class'. The 'content' part or a URL can include any words you like, so saying a URL is something like "https://([a-z]\.)+[a-z]+" is not so different from saying that a Latin o-stem declension noun is something like "s?([^aeiou][aeiou]+[^aeiou])+us" (neither of these is the exact pattern of course, but hopefully the idea is clear).

why URLs should not then be PRON

For the same reason that referring NPs like "the matter" or "this issue" are not pronouns, and more generally even proper names: "Germany" stands for an actual country, but as a word it is an independent and unique reference, regardless of the reality of the denotation. Imaginary places like "Narnia" are the same - they are proper nouns because they have the morphosyntax (can be subjects or objects, resist use of articles in English) and the semantics (unique reference) of names. URLs fit this pattern: we don't say "the google.com" but "I searched for it on google.com", like "I looked for it in Walmarts".

we are dealing with written language

This is an aside, but we are not only dealing with written language - here is a spoken example from YouTube where someone utters their handle in speech, with the spoken "at" (=@)

https://universal.grew.fr/?custom=651d763018680

nschneid commented 1 year ago

URLs and email addresses are interesting categories because their meaning is in the context of text-based technology, and because they may mash together multiple "normal" words (this can be considered a kind of compounding/word formation).

I want to return to @dan-zeman's point about pronunciation:

SYM itself is a bit odd category because it contains things that are written in a special way (usually non-alphanumeric) but are not PUNCT and can be pronounced; but that pronounced thing, if written differently, would have its own non-SYM POS category (at least some of them; not sure about emoticons).

In the sentence "Please visit universaldependencies.org", I would read the parts of the URL directly, pronouncing it as 'universal dependencies dot org'. (This includes "dot", which is a term of a punctuation mark, because web addresses are a text-based medium.) So the way the URL is written, while adhering to special orthographic conventions because of the technology it derives its meaning from, is not fundamentally different from how it's pronounced (its alphanumeric parts are treated like spaceless words). There is no alternative "non-abbreviated" way to write a URL if we want to capture how it actually works in the technology.

This is different from "$" which is simply pronounced 'dollar(s)' in most cases—it is an orthographic shorthand, hence SYM.

Stormur commented 1 year ago

What I am saying is that URLs used in a sentence ordinarily act like nominals, not that they are coerced into nominals in certain sentences (e.g. by omitting a head noun or mentioning rather than using the word).

My expectation would be that items functioning primarily as nominals should be NOUN or PROPN by default—unless they are symbols (SYM), numerals (NUM), or part of the grammatical system of nominal anaphora (PRON).

This would lead to label as NOUN everything: why distinguish PRONs or NUMs or else? If they behave as nouns, we will see this by means of the relations nsubj, obj, etc. This is an old issue, we cannot project the syntactic relations back on the parts of speech...

Regular expressions are really just descriptions of finite state automata, and word formation in many languages can be described using a finite state morphology,

I think there is still quite a substantial difference. Given a language, maybe I can describe its word formation processes with finite automata, but this is all a posteriori; for websites, I know their structure a priori and I can actually produce (and enuerate) all (infinite) possibilities.

Anyway, the discussion here seems to imply that websites could be an open class because they are formed by elements of lexical open classes, since they are often made of some kinds of phrases, e.g. you-tube, le-monde.. . But then why not analyse them as phrases?

why URLs should not then be PRON

For the same reason that referring NPs like "the matter" or "this issue" are not pronouns, and more generally even proper names: "Germany" stands for an actual country, but as a word it is an independent and unique reference, regardless of the reality of the denotation. Imaginary places like "Narnia" are the same - they are proper nouns because they have the morphosyntax (can be subjects or objects, resist use of articles in English) and the semantics (unique reference) of names. URLs fit this pattern: we don't say "the google.com" but "I searched for it on google.com", like "I looked for it in Walmarts".

I do not see the connection between pronouns and generic and abstract words like matter or issue, or website for the issue at hand. I can just imagine that with time they can be grammaticalised into pronouns, but that's it. Anyway, PROPNs are just NOUNs, their identification as "proper" relies entirely on extra-linguistic factors and some morphosyntactic behaviours are a consequence of these, and not the other way round. But the fact is that all these names, Germany, Narnia, matter are words. Websites are not. They are conventional alphanumeric strings that serve as links to something else. They are pointers and itis what they point to that has an actual name. In my opinion, then, we might easily consider the spoken enunciation of a website as metalinguistic.

This is an aside, but we are not only dealing with written language - here is a spoken example from YouTube where someone utters their handle in speech, with the spoken "at" (=@)

https://universal.grew.fr/?custom=651d763018680

When it becomes enunciated, it is "translated" into actual words, so at jas the nurse, but in the written medium, it is symbolic. But as I mentioned before, I would be in favour of a split treatment (so in this case separating the symbol @ from the phrase and analyse this as a MWT).

amir-zeldes commented 1 year ago

this is all a posteriori; for websites, I know their structure a priori

I don't think so - as you pointed out, website URLs generally contain words, so if we can't know all words a priori, then we can't know all URLs. If you just mean that URL content (aside from https://) has allowable characters, then this is not very different from saying that words must follow the pattern of pronounceable syllables in a language. Both of these are true, and both of these are as infinite an enumerable as each other IMO. This is different from PRONs, which are a closed (and typically quite small) class.

This would lead to label as NOUN everything: why distinguish PRONs or NUMs or else

I agree that (nominal) pronouns are essentially the same as nouns syntactically (and adverbial ones are adverbs, etc). NUM is not the same as NOUN, since cardinal numbers have a very different distribution externally: We can say "three dictrionary volumes", but we can't say "volumes three dictionary" - numbers occupy the initial position in NPs (and follow the D article in a DP model, or adjoin to it in a definite NP view). Internally, both PRON and PROPN are distinct from NOUN, since at least in English, they resist use of an article (though this is not true for everything that UD annotates as PROPN, on semantic grounds).

the connection between pronouns and generic and abstract words like matter or issue

I just meant that these can both be anaphoric: "[The trial]_1 ended. [It/the matter]_1 worried me". But the pronoun class is closed, so even content-poor nouns like "matter", which behave morphosyntactically like nouns, are not included in it, despite their semantic similarity.

Stormur commented 1 year ago

What I want to highlight is some circularity in the suggestion of treating website addresses as more or less full-fledged nouns: if they are such because they are made up of "regular" words, then it is these words that should be annotated accordingly (e.g. by MWT), possibly using markings such as Form=Website or similar; but if they are not words, then they cannot be treated as other lexical elements, be they NOUNs, PRONs, or else.

nschneid commented 1 year ago

It occurred to me to look up the PTB SYM tag, which is narrower:

image

For illustration, here are the current lemmas with xpos=SYM in GUM:

image

and EWT:

image

The alphanumeric ones may be debatable but the point is that this is a narrowly defined class, and symbols that are just ways of abbreviating regular words ("%", "&") receive the same XPOS as the spelled-out word.* This also avoids questions like whether it is appropriate to have a Number feature for nouns spelled as symbols (UniversalDependencies/UD_English-EWT#445).

* OK this isn't strictly true for "$" because PTB has a $ tag reserved for currency symbols, distinct from the word "dollar" which would be NN. UPOS, of course, has no category just for currency. xpos=$ is currently mapped to upos=SYM but could just as easily be mapped to NOUN.

amir-zeldes commented 1 year ago

Yes, all of this strengthens my sense that 'symbols' is something different from things like URLs. We should probably retag the long examples you have above for GUM, including the IPA one, and the two file names.

if they are such because they are made up of "regular" words, then it is these words that should be annotated accordingly

In theory such an expression could be a single 'word', for example #Victory or google.com, where '.com' is not really a word. I don't think the idea of tagging these as nouns (or really proper nouns) is about trying to retrieve the (possibly multiple) lexical roots inside the URL etc. - the intention is to say that they function as names as a whole. It's true that something like "https://comparethemarket.com" might be considered to contain 'words', but so do compounds like "blackbird", and we still treat them as monolithic nouns (although we could do a morpholgical analysis in another layer, e.g. MSeg in MISC).

nschneid commented 11 months ago

The Core Group decided that PROPN, not SYM, is appropriate for internet addresses and mixed alphanumeric telephone numbers. Moved the examples.