UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Filenames and other computery entities #666

Open nschneid opened 4 years ago

nschneid commented 4 years ago

The email genre of English-EWT lists file attachments, e.g. "Constellation Power (GISB draft).doc".

  1. Should filenames always be tokenized into discernible linguistic words ("ConstellationPower(GSB_draft).doc"), or only when there are spaces?
    • What about filesystem paths and URLs containing spaces?
    • Presumably we would never tokenize email addresses, hashtags, or variable names in code as these never contain spaces
  2. To what extent should annotators attempt to infer internal structure, like in titles of artistic works (#664)? E.g. the above could include two compound relations and an appos relation for the parenthetical. I'm not sure how ".doc" should attach—flat?
dan-zeman commented 4 years ago

If there are no spaces, I would keep ".doc" together with the main name in one token.

Then it seems natural to treat the filename as one word with spaces, although personally I am not a big fan of words with spaces. The dot (adjacent to a letter on both sides) makes it recognizable as a validation exception; without extension, it would be tokenized and analyzed like movie/book titles.

Or maybe we could do without words with spaces completely and only keep the last word together with ".doc" while the other words would be separate tokens.

amir-zeldes commented 4 years ago

I could see a case for using goeswith here - if you believe that filenames are 'single words' then in some sense they should be spelled together, but there is a space here. So it's somewhat similar to a single word broken up into two tokens because of a space?

martinpopel commented 4 years ago

@amir-zeldes If you know that a given filename does not include a space, but there is a typo in the text (e.g. "auto exec.bat" or "~/.bash rc") then you can use goeswith. However, nowadays there are many filenames containing spaces (e.g. "Constellation Power (GISB draft).doc" mentioned by @nschneid) and I think we should not use goeswith here. We should not break the rule that goeswith is reserved only for text that is not well edited and that by deleting the extra space you obtain a better edited text.

amir-zeldes commented 4 years ago

I think you definitely obtain a better file name by deleting spaces :)

But I see your point!

nschneid commented 3 years ago

Another question: should these be PROPN?

amir-zeldes commented 3 years ago

I think PROPN makes sense. In EWT xpos could also be either NNP or ADD, by analogy to URLs (I guess they are all like URIs?)

nschneid commented 3 months ago

Another reason to be skeptical about goeswith is that filenames-with-spaces are compositional and we don't think of them as having a single lemma in the language. So I think flat is the better choice.

Here's another example:

# sent_id = email-enronsent32_01-0039
# text = - GPSA Guaranty.doc
1   -   -   PUNCT   NFP _   2   punct   2:punct _
2   GPSA    gpsa    NOUN    GW  _   0   root    0:root  _
3   Guaranty.doc    guaranty.doc    X   NN  Number=Sing 2   flat    2:flat  _

The last part, "Guaranty.doc", has an odd combination of X and Number=Sing. Should it be treated as a NOUN? Should ".doc" be split off as a separate word and tagged as X?

AngledLuffa commented 3 months ago

What about parsing it as a single token? There's precedent for tokens with spaces in French for example when they represent a single concept

nschneid commented 3 months ago

What are some of the French examples? I was only aware of this being done for numbers where the space separator is merely for readability.

bguil commented 3 months ago

In French, spaces are only in numbers (examples).

The forms with space accepted by the validator have to be defined for each language with a regexp. For instance in French, it is: [0-9 ,]+ (defined here).

arademaker commented 3 months ago

In the case of files’ names, I really can’t see a reason to add a syntax relation between part of the names. Just one token with spaces in the form and lemma equal the form.

nschneid commented 3 months ago

In the case of files’ names, I really can’t see a reason to add a syntax relation between part of the names. Just one token with spaces in the form and lemma equal the form.

As @bguil notes, the validator requires a regex for narrow exceptions to the rule that words should generally not contain spaces. I don't see a good way to do this for just filenames.

flat is the usual solution for names containing spaces where no single head can be determined.

I am thinking of implementing the following policy for English-EWT:

  1. While some filenames contain portions that are recognizable as linguistic phrases, and some are recognizable as filenames due to endings (extensions) like ".doc", in general filenames do not necessarily follow any regular syntactic rules of the language. They also are not subject to normative rules of spacing or punctuation that would apply to normal text. Some filenames are very difficult to interpret structurally ("Stipulation -ECT-KEDNE re IGTS & Tennessee Cap Releases -FINAL.doc - BLACKLINE -Stip -ECT-KEDNE re Cap Releases -2-F.doc"). For simplicity, therefore, multi-token filenames are treated as flat structures across the board. (This is a bit like the foreign expression analysis except for filenames the words are typically from the natural language, just not their structure.)
  2. POS and morphology: Recognizable and tokenized words within the filename are tagged as they would be in the name of a book or article title.
  3. ".doc" and similar extensions may be tokenized separately (this is recommended for filenames with spaces to avoid spurious words like "draft.doc" where ".doc" is really appended to a longer phrase). Such extensions are tagged X. The lemma retains the exact form of the extension.
  4. The tokenization need not split words with no separating spaces. E.g. "ld2d-#69366-1" may be considered a single word (tagged PROPN).
  5. If a filename is not split, it is tagged PROPN.
arademaker commented 3 months ago

As @bguil notes, the validator requires a regex for narrow exceptions to the rule that words should generally not contain spaces.

Does it make sense to put the validator constraint before our syntactic theory? The validator is just a tool to help us to ensure as much as possible consistency. We can always revise the tool if needed. Using the tool restriction as argument to decide guidelines doesn’t make sense for me.

nschneid commented 3 months ago

Not just the tool per se, but a widespread practice within UD, which (as I understand it) is a strong presumption against words-with-spaces for a range of kinds of named entities. Telephone numbers are perhaps a good analogy.

If we had something really syntactic to say about filenames, it would be one thing, but we're basically saying they're internally not governed by syntactic rules, and flat seems as a good a way to do that as anything....

If there is a language where filenames within a sentence end up being morphologically inflected, for example, that might weigh in favor of a non-flat analysis so features can be assigned to the correct unit. But insofar as I am dealing with English and most of these filenames are standalone "sentences" from email metadata, I don't see this.

amir-zeldes commented 3 months ago

I haven't had to deal with them before, but I think my inclination would be to use goeswith but without Typo=Yes. In other words, I think of them as single tokens that unfortunately happen to have spaces, so they need to be linked with goeswith. Normally this is the result of a typo (space in the middle of a word), but in this case I wouldn't say it's a 'mistake', so I would just refrain from using Typo. I'm aware the goeswith guidelines say it's for badly spelled text, but I would prefer to extend the documentation to include files with spaces, rather than have multiple 'true tokens' with tags and deprels in there.

arademaker commented 3 months ago

IMHO, better than flat!

nschneid commented 3 months ago

I'm wary of removing the Typo=Yes requirement that we established for goeswith as (1) it's a reversal of a guidelines amendment and (2) it would create confusion as to whether Typo=Yes is appropriate for the vast majority of goeswith units (if it can't be checked for, people will forget to provide it).

And I don't see any particular problem with noting e.g. that "Releases" in the long filename I posted above is a plural noun attaching as flat (I am guessing; seems more likely than VERB). Words that are hard to decide a tag for can simply be X in this context.

Curious to hear @dan-zeman's opinion.

mr-martian commented 3 months ago

What there was a requirement that Typo= accompany goeswith but have filenames and such be marked with Typo=No?

nschneid commented 3 months ago

Interesting idea...what would be the criterion for "and such"? :D I.e. what are the characteristics of expressions that this strategy should be used for, beyond filenames?

amir-zeldes commented 3 months ago

I guess that could work for phone numbers too?

mr-martian commented 3 months ago

Perhaps an inappropriate "and such" on my part, but I suppose that would cover any other tokens with spaces that aren't mistakes, though I have no examples ready to hand.

nschneid commented 3 months ago

So...named entities correctly including spaces but lacking regular internal syntax? I thought that's what flat was for—how to draw the boundary?

amir-zeldes commented 3 months ago

No, that's not how I understood it - I thought the idea was to use it for things we consider to be single 'words', which I guess could be things that have a single lexical category. For example, I think phone numbers are just numbers, so they have the single category NUM, and if they happen to be spelled with internal spaces, we could use goeswith to mean we think they are functioning as a single lexical item, but use Typo=No to indicate the spelling with space is expected/canonical.

nschneid commented 3 months ago

I thought the idea was to use it [broadened goeswith] for things we consider to be single 'words'

In general, how would we tell that though? If we're stepping away from the idea that wordhood, absent morphosyntactic cues, is defined by orthography, it seems like opening a can of worms....e.g. one could argue that a telephone number is made up of individual digits, each of which is in principle a word regardless of the spacing. Or one could argue that a foreign expression written with a space (et cetera) is actually a single word of English.

In my interpretation, flat and X already give us the fudge factor we need to deal with real data. Introducing an entirely new kind of wordhood seems risky unless there is a clear test.

amir-zeldes commented 3 months ago

Hm, OK - I don't urgently need anything to happen here, but it sounded like this was already being done for numbers with spaces, so in as far as someone had a criterion for why they used spaces in tokens, I think it would be the same criterion applying to this suggestion.

Concretely regarding filenames with spaces, they feel like the same sort of things as phone numbers with spaces to me. If a guideline is formulated which explicitly covers only phone numbers and files (or maybe URIs in general?), then I don't see the danger of a slippery slope. For me spaces in tokens are worse than almost any other solution!

nschneid commented 3 months ago

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

sylvainkahane commented 3 months ago

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

dan-zeman commented 3 months ago

Curious to hear @dan-zeman's opinion.

I find flat better than goeswith. Also, if flat is the policy, it will require just a small clarification somewhere, while if goeswith is the policy, it will be an amendment and we will have to carefully scan the guidelines for places that talk about goeswith and say it is used only for ill-edited text.

I also like the flexibility that if file name has spaces and is tokenized into multiple tokens, these may or may not get morphological analysis depending on what makes more sense in individual cases.

dan-zeman commented 3 months ago

numbers with spaces

Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.

But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.

Exactly. Spaces in numbers are regulated by the standardized spelling in Czech (as well as some other languages). Telephone numbers are not (and some people, like me, use hyphens instead of spaces in them). But at least telephone numbers are still "numbers" (plus punctuation), so I would not mind treating them the same way as normal numbers if the latter already can have spaces in the language. I would definitely not treat alphanumeric file names this way. And if the language does not have an exception for numbers, I would cluster telephone numbers with file names.

dan-zeman commented 3 months ago

Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith and Typo=No could be a better solution, I think.

I think the standard solution we already have for this is fixed. No need for goeswith here.

sylvainkahane commented 3 months ago

@dan-zeman But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.

jnivre commented 3 months ago

No, fixed is precisely for words with spaces (not for MWEs in general).

Skickat från Outlook för iOShttps://aka.ms/o0ukef


Från: Sylvain Kahane @.> Skickat: Wednesday, May 29, 2024 6:56:16 PM Till: UniversalDependencies/docs @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/docs] Filenames and other computery entities (#666)

@dan-zemanhttps://github.com/dan-zeman But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.

— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/666#issuecomment-2137869988, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVSRCI4HYPLNZRG6TCLZEYCDBAVCNFSM4JHLWCLKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJTG44DMOJZHA4A. You are receiving this because you are subscribed to this thread.Message ID: @.***>

VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

nschneid commented 3 months ago

Right, I think of the breakdown as follows:

Stormur commented 3 months ago

I would be in favour of keeping them as tokens with internal spaces. If not, I am not sure we really want to use flat, since this would mean that we would always like to analyse all the elements of all such file names as if they were "actual words". This seems to me really difficult to me, as these strings are mostly placeholders which occasionally contain strings looking like well-formed phrases, but this is misleading. For this reason, as discussed under another issue, I would vie for SYM as their part of speech. In this context, fixed might be the better choice in the end, even if in my personal opinion it seems to tell something different than a token with spaces.

nschneid commented 3 months ago

The Core Group discussed this and decided on flat. I understand there is a concern about treating a filename as having multiple words that are in some sense linguistically independent units, but I think that's too strong of an interpretation of flat. Like fixed for grammatical expressions and goeswith for misspellings, flat can apply in some cases where the morphosyntactic notion of word contains multiple tokens per the tokenization. And tokenizing on (at minimum) spaces is a very strong convention for languages where the primary function of spaces is to show a word boundary.

X is available for the UPOS of tokens regarded as something smaller than a syntactic word (or not an "actual word", in line with @Stormur's concern). At the discretion of treebanks, a filename might be analyzed as containing some recognizable words with substantive UPOS/feats, or they might all be labeled X. The syntactic category of the whole filename can be signaled with ExtPos=PROPN.

(In retrospect, perhaps instead of flat/fixed/goeswith it would have been better to have one relation for multi-token words and another relation for headless multi-word expressions. Something to consider for a potential UDv3.)

sylvainkahane commented 3 months ago

I think we are completely loosing the meaning of the UD syntactic relations, or at least I am completely lost. flat is used for headless constructions, such as the "first name - second name" construction. They are particular constructions in the sense of CxG for instance. It is true that flat:foreign is also use foreign expressions, and in this case does not really refer to a headless construction, but ok. For the cases we are discussing here, I don't think they are headless constructions in any acceptable sense.

In the other way, goeswith means 'goes with', that is two tokens that should be together. It can be because of a misspelling or, as proposed, because of a strange orthographic convention. Contrary to flat, goeswith clearly indicates that there is no construction in this case. I think we should clearly separate dependency labels referring to syntactic constructions from non-linguistic dependency labels.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

Stormur commented 3 months ago

With regard to fixed, there clearly is a problem in how it is used more than in how it is defined.

Tokenisation over spaces would be the opposite and complementary option of multiword tokens. I think it might be very useful to recognise that spaces are actually often used to separate things which are at an intermediate level between what we identify as syntactic words and phrases, but, like punctuation marks, cannot be an ultimate tokenisation criterion themselves. If this really has an impact on current parsers needs to be investigated, but from a machine point of view a space is just a character like any other.

I do not think I put forward a too strong interpretation of flat: it is defined to be used for "flat" phrases, so it entails a linguistic interpretation. A filename has none such interpretation, as neither does an email address, a phone number, any number expressed by means of symbols... so I think it should be avoided, because a file name, i.e. a single block of alphanumeric + other characters, is really different from a personal name with many components, which all by themselves are morphosyntactically analysable words.

By the way, flat is dangerously close to conj up to the point one wonders where the difference is, but this is another story...

jnivre commented 3 months ago

I think the point that “flat” indicates a construction but “goeswith” does not is a good one. I hadn’t thought of that. On the other hand, the main use of “goeswith” also carries the implication that it is accidental and erroneous, which doesn’t apply to the filename case (presumably), so one would have to decide which is the most important criterion.

When it comes to “fixed”, I do maintain that it should be restricted to “words with spaces”, as stated in the documentation, but its application across languages and treebanks is currently quite inconsistent. This is not least true about the Swedish treebanks, as pointed out by my colleague Lars Ahrenberg in a paper at this year’s UD workshop. In addition, I think there may be different conceptions of what a “word with spaces” is. You mention the example “parce que” in French and the fact that “parce” is only used in that combination. This is clearly a good indication that it is a word with spaces, but I don’t think the occurrence of such an element is a necessary condition.

Let me give the example of expression referring to days in Swedish. The equivalent of “today” is “i dag” or “idag” (both orthographies are common and accepted as correct); the equivalent of “yesterday” is “i går” or “igår”. It so happens that “går” is like “parce”, that is, it only occurs in this combination (disregarding the homonymous verb form meaning “walk”), while “dag” is a regular noun meaning “day”. However, I would argue that both expressions are equally frozen in modern Swedish and should be analyze as “fixed” when written with a space.

nschneid commented 3 months ago

I think @sylvainkahane is suggesting a primary distinction between multi-token words (words-with-spaces) and headless phrases (where individual elements might be omissible, modifiable, etc.). That sounds perfectly sensible to me, it's just not what UDv2 has given its narrow definitions of goeswith and fixed, and its broad definition of flat.

Some treebanks are using flat:foreign as a way to acknowledge that foreign expressions are a bit different in this regard from the flat expressions that are headless phrases. What about another subtype that would apply to the telephone numbers and filenames, e.g. flat:mtw for "multi-token word"?

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered, also because many people expect the term "fixed" to cover morphosyntactically fixed expressions in general, whereas it is only intended for a small list of grammatical ones.

By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed, saying that "fixed is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).

The current list of English fixed expressions is documented here. It is largely inherited from the Stanford Dependencies annotation of EWT, and there are definitely debatable cases in this list, as well as others that maybe should be added to the list (https://github.com/UniversalDependencies/UD_English-EWT/issues/400). I'm happy to discuss those separately, but for purposes of the present discussion, we should go by the universal definition at https://universaldependencies.org/u/dep/fixed.html.

LarsAhrenberg commented 3 months ago

I would like to express my support for @nschneid's suggestion that

If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered,

It is obvious from this discussion that so many long-time UD experts have different intuitions on how these relations should be used. And although the guidelines for fixed have been updated they are still not detailed enough. What is actually meant by 'the most grammaticalized cases'? In the paper @jnivre refers to, I try to identify (in Swedish) what I call rigid expressions, i.e. those showing no variation at all. But they are still too numerous to qualify as 'a closed class'.

The comment by @sylvainkahane that he sees flat as a relation for headless constructions I find interesting. The problem is that UD currently only recognizes one such construction, ie names. Currently, fixed is used for many expressions that have an internal head, such as ADP + NOUN which we may call 'headed constructions' with the noun as the head even if it is non-determined. If UD keeps only one deprel for headless constructions, the distinction between names and fixed non-headed expressions (and typos) could instead be made with features, say in the MISC column. And with a feature for fixedness the headed fixed expressions could have both their syntax annotated (with deprels) and their status as fixed expressions represented.