Open nschneid opened 4 years ago
If there are no spaces, I would keep ".doc" together with the main name in one token.
Then it seems natural to treat the filename as one word with spaces, although personally I am not a big fan of words with spaces. The dot (adjacent to a letter on both sides) makes it recognizable as a validation exception; without extension, it would be tokenized and analyzed like movie/book titles.
Or maybe we could do without words with spaces completely and only keep the last word together with ".doc" while the other words would be separate tokens.
I could see a case for using goeswith
here - if you believe that filenames are 'single words' then in some sense they should be spelled together, but there is a space here. So it's somewhat similar to a single word broken up into two tokens because of a space?
@amir-zeldes If you know that a given filename does not include a space, but there is a typo in the text (e.g. "auto exec.bat" or "~/.bash rc") then you can use goeswith
. However, nowadays there are many filenames containing spaces (e.g. "Constellation Power (GISB draft).doc" mentioned by @nschneid) and I think we should not use goeswith
here. We should not break the rule that goeswith
is reserved only for text that is not well edited and that by deleting the extra space you obtain a better edited text.
I think you definitely obtain a better file name by deleting spaces :)
But I see your point!
Another question: should these be PROPN
?
I think PROPN makes sense. In EWT xpos could also be either NNP or ADD, by analogy to URLs (I guess they are all like URIs?)
Another reason to be skeptical about goeswith
is that filenames-with-spaces are compositional and we don't think of them as having a single lemma in the language. So I think flat
is the better choice.
Here's another example:
# sent_id = email-enronsent32_01-0039
# text = - GPSA Guaranty.doc
1 - - PUNCT NFP _ 2 punct 2:punct _
2 GPSA gpsa NOUN GW _ 0 root 0:root _
3 Guaranty.doc guaranty.doc X NN Number=Sing 2 flat 2:flat _
The last part, "Guaranty.doc", has an odd combination of X and Number=Sing. Should it be treated as a NOUN? Should ".doc" be split off as a separate word and tagged as X?
What about parsing it as a single token? There's precedent for tokens with spaces in French for example when they represent a single concept
What are some of the French examples? I was only aware of this being done for numbers where the space separator is merely for readability.
In the case of files’ names, I really can’t see a reason to add a syntax relation between part of the names. Just one token with spaces in the form and lemma equal the form.
In the case of files’ names, I really can’t see a reason to add a syntax relation between part of the names. Just one token with spaces in the form and lemma equal the form.
As @bguil notes, the validator requires a regex for narrow exceptions to the rule that words should generally not contain spaces. I don't see a good way to do this for just filenames.
flat
is the usual solution for names containing spaces where no single head can be determined.
I am thinking of implementing the following policy for English-EWT:
As @bguil notes, the validator requires a regex for narrow exceptions to the rule that words should generally not contain spaces.
Does it make sense to put the validator constraint before our syntactic theory? The validator is just a tool to help us to ensure as much as possible consistency. We can always revise the tool if needed. Using the tool restriction as argument to decide guidelines doesn’t make sense for me.
Not just the tool per se, but a widespread practice within UD, which (as I understand it) is a strong presumption against words-with-spaces for a range of kinds of named entities. Telephone numbers are perhaps a good analogy.
If we had something really syntactic to say about filenames, it would be one thing, but we're basically saying they're internally not governed by syntactic rules, and flat
seems as a good a way to do that as anything....
If there is a language where filenames within a sentence end up being morphologically inflected, for example, that might weigh in favor of a non-flat analysis so features can be assigned to the correct unit. But insofar as I am dealing with English and most of these filenames are standalone "sentences" from email metadata, I don't see this.
I haven't had to deal with them before, but I think my inclination would be to use goeswith
but without Typo=Yes
. In other words, I think of them as single tokens that unfortunately happen to have spaces, so they need to be linked with goeswith
. Normally this is the result of a typo (space in the middle of a word), but in this case I wouldn't say it's a 'mistake', so I would just refrain from using Typo. I'm aware the goeswith
guidelines say it's for badly spelled text, but I would prefer to extend the documentation to include files with spaces, rather than have multiple 'true tokens' with tags and deprels in there.
IMHO, better than flat
!
I'm wary of removing the Typo=Yes
requirement that we established for goeswith
as (1) it's a reversal of a guidelines amendment and (2) it would create confusion as to whether Typo=Yes
is appropriate for the vast majority of goeswith
units (if it can't be checked for, people will forget to provide it).
And I don't see any particular problem with noting e.g. that "Releases" in the long filename I posted above is a plural noun attaching as flat
(I am guessing; seems more likely than VERB). Words that are hard to decide a tag for can simply be X in this context.
Curious to hear @dan-zeman's opinion.
What there was a requirement that Typo=
accompany goeswith
but have filenames and such be marked with Typo=No
?
Interesting idea...what would be the criterion for "and such"? :D I.e. what are the characteristics of expressions that this strategy should be used for, beyond filenames?
I guess that could work for phone numbers too?
Perhaps an inappropriate "and such" on my part, but I suppose that would cover any other tokens with spaces that aren't mistakes, though I have no examples ready to hand.
So...named entities correctly including spaces but lacking regular internal syntax? I thought that's what flat was for—how to draw the boundary?
No, that's not how I understood it - I thought the idea was to use it for things we consider to be single 'words', which I guess could be things that have a single lexical category. For example, I think phone numbers are just numbers, so they have the single category NUM, and if they happen to be spelled with internal spaces, we could use goeswith
to mean we think they are functioning as a single lexical item, but use Typo=No to indicate the spelling with space is expected/canonical.
I thought the idea was to use it [broadened goeswith] for things we consider to be single 'words'
In general, how would we tell that though? If we're stepping away from the idea that wordhood, absent morphosyntactic cues, is defined by orthography, it seems like opening a can of worms....e.g. one could argue that a telephone number is made up of individual digits, each of which is in principle a word regardless of the spacing. Or one could argue that a foreign expression written with a space (et cetera) is actually a single word of English.
In my interpretation, flat
and X already give us the fudge factor we need to deal with real data. Introducing an entirely new kind of wordhood seems risky unless there is a clear test.
Hm, OK - I don't urgently need anything to happen here, but it sounded like this was already being done for numbers with spaces, so in as far as someone had a criterion for why they used spaces in tokens, I think it would be the same criterion applying to this suggestion.
Concretely regarding filenames with spaces, they feel like the same sort of things as phone numbers with spaces to me. If a guideline is formulated which explicitly covers only phone numbers and files (or maybe URIs in general?), then I don't see the danger of a slippery slope. For me spaces in tokens are worse than almost any other solution!
numbers with spaces
Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.
But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.
Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and goeswith
and Typo=No
could be a better solution, I think.
Curious to hear @dan-zeman's opinion.
I find flat
better than goeswith
. Also, if flat
is the policy, it will require just a small clarification somewhere, while if goeswith
is the policy, it will be an amendment and we will have to carefully scan the guidelines for places that talk about goeswith
and say it is used only for ill-edited text.
I also like the flexibility that if file name has spaces and is tokenized into multiple tokens, these may or may not get morphological analysis depending on what makes more sense in individual cases.
numbers with spaces
Yeah, maybe somebody can weigh in on what warranted that exception—I assume because it's routine in some orthographic styles to use spaces for thousands separators whereas we'd use commas, and numerals are so frequent in many genres that it would be cumbersome to break them up.
But space separation for special numeric entities (like telephone numbers) does NOT warrant this exception.
Exactly. Spaces in numbers are regulated by the standardized spelling in Czech (as well as some other languages). Telephone numbers are not (and some people, like me, use hyphens instead of spaces in them). But at least telephone numbers are still "numbers" (plus punctuation), so I would not mind treating them the same way as normal numbers if the latter already can have spaces in the language. I would definitely not treat alphanumeric file names this way. And if the language does not have an exception for numbers, I would cluster telephone numbers with file names.
Every language can have some strange orthographic conventions for a couple of words. For instance, French has at least one: parce que 'because'. Nobody wants to have parce as a word, because it doesn't exist without que. But it would be costly and dangerous to relax the rule forbidding tokens with spaces just for this word and
goeswith
andTypo=No
could be a better solution, I think.
I think the standard solution we already have for this is fixed
. No need for goeswith
here.
@dan-zeman But fixed
is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.
No, fixed is precisely for words with spaces (not for MWEs in general).
Skickat från Outlook för iOShttps://aka.ms/o0ukef
Från: Sylvain Kahane @.> Skickat: Wednesday, May 29, 2024 6:56:16 PM Till: UniversalDependencies/docs @.> Kopia: Subscribed @.***> Ämne: Re: [UniversalDependencies/docs] Filenames and other computery entities (#666)
@dan-zemanhttps://github.com/dan-zeman But fixed is for MWEs, no? parce que is a word, not a MWE. It is word written with a space. As I said parce is not a word of French, just a strange orthographic form.
— Reply to this email directly, view it on GitHubhttps://github.com/UniversalDependencies/docs/issues/666#issuecomment-2137869988, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZ7ZVSRCI4HYPLNZRG6TCLZEYCDBAVCNFSM4JHLWCLKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJTG44DMOJZHA4A. You are receiving this because you are subscribed to this thread.Message ID: @.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
Right, I think of the breakdown as follows:
goeswith
is for incorrectly added spaces (typos)fixed
is for words-with-spaces (grammatical elements of a language that are conventionally spelled with a space for historical reasons)flat
is for other cases where no single syntactic head can be identified—typical examples are named entities that aren't structured by general syntactic relations, foreign/borrowed phrases, and repetitions or sound sequences.I would be in favour of keeping them as tokens with internal spaces. If not, I am not sure we really want to use flat
, since this would mean that we would always like to analyse all the elements of all such file names as if they were "actual words". This seems to me really difficult to me, as these strings are mostly placeholders which occasionally contain strings looking like well-formed phrases, but this is misleading. For this reason, as discussed under another issue, I would vie for SYM
as their part of speech. In this context, fixed
might be the better choice in the end, even if in my personal opinion it seems to tell something different than a token with spaces.
The Core Group discussed this and decided on flat
. I understand there is a concern about treating a filename as having multiple words that are in some sense linguistically independent units, but I think that's too strong of an interpretation of flat
. Like fixed
for grammatical expressions and goeswith
for misspellings, flat
can apply in some cases where the morphosyntactic notion of word contains multiple tokens per the tokenization. And tokenizing on (at minimum) spaces is a very strong convention for languages where the primary function of spaces is to show a word boundary.
X
is available for the UPOS of tokens regarded as something smaller than a syntactic word (or not an "actual word", in line with @Stormur's concern). At the discretion of treebanks, a filename might be analyzed as containing some recognizable words with substantive UPOS/feats, or they might all be labeled X
. The syntactic category of the whole filename can be signaled with ExtPos=PROPN
.
(In retrospect, perhaps instead of flat/fixed/goeswith it would have been better to have one relation for multi-token words and another relation for headless multi-word expressions. Something to consider for a potential UDv3.)
I think we are completely loosing the meaning of the UD syntactic relations, or at least I am completely lost. flat
is used for headless constructions, such as the "first name - second name" construction. They are particular constructions in the sense of CxG for instance. It is true that flat:foreign
is also use foreign expressions, and in this case does not really refer to a headless construction, but ok. For the cases we are discussing here, I don't think they are headless constructions in any acceptable sense.
In the other way, goeswith
means 'goes with', that is two tokens that should be together. It can be because of a misspelling or, as proposed, because of a strange orthographic convention. Contrary to flat
, goeswith
clearly indicates that there is no construction in this case. I think we should clearly separate dependency labels referring to syntactic constructions from non-linguistic dependency labels.
By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about fixed
, saying that "fixed
is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).
With regard to fixed
, there clearly is a problem in how it is used more than in how it is defined.
Tokenisation over spaces would be the opposite and complementary option of multiword tokens. I think it might be very useful to recognise that spaces are actually often used to separate things which are at an intermediate level between what we identify as syntactic words and phrases, but, like punctuation marks, cannot be an ultimate tokenisation criterion themselves. If this really has an impact on current parsers needs to be investigated, but from a machine point of view a space is just a character like any other.
I do not think I put forward a too strong interpretation of flat
: it is defined to be used for "flat" phrases, so it entails a linguistic interpretation. A filename has none such interpretation, as neither does an email address, a phone number, any number expressed by means of symbols... so I think it should be avoided, because a file name, i.e. a single block of alphanumeric + other characters, is really different from a personal name with many components, which all by themselves are morphosyntactically analysable words.
By the way, flat
is dangerously close to conj
up to the point one wonders where the difference is, but this is another story...
I think the point that “flat” indicates a construction but “goeswith” does not is a good one. I hadn’t thought of that. On the other hand, the main use of “goeswith” also carries the implication that it is accidental and erroneous, which doesn’t apply to the filename case (presumably), so one would have to decide which is the most important criterion.
When it comes to “fixed”, I do maintain that it should be restricted to “words with spaces”, as stated in the documentation, but its application across languages and treebanks is currently quite inconsistent. This is not least true about the Swedish treebanks, as pointed out by my colleague Lars Ahrenberg in a paper at this year’s UD workshop. In addition, I think there may be different conceptions of what a “word with spaces” is. You mention the example “parce que” in French and the fact that “parce” is only used in that combination. This is clearly a good indication that it is a word with spaces, but I don’t think the occurrence of such an element is a necessary condition.
Let me give the example of expression referring to days in Swedish. The equivalent of “today” is “i dag” or “idag” (both orthographies are common and accepted as correct); the equivalent of “yesterday” is “i går” or “igår”. It so happens that “går” is like “parce”, that is, it only occurs in this combination (disregarding the homonymous verb form meaning “walk”), while “dag” is a regular noun meaning “day”. However, I would argue that both expressions are equally frozen in modern Swedish and should be analyze as “fixed” when written with a space.
I think @sylvainkahane is suggesting a primary distinction between multi-token words (words-with-spaces) and headless phrases (where individual elements might be omissible, modifiable, etc.). That sounds perfectly sensible to me, it's just not what UDv2 has given its narrow definitions of goeswith
and fixed
, and its broad definition of flat
.
Some treebanks are using flat:foreign
as a way to acknowledge that foreign expressions are a bit different in this regard from the flat expressions that are headless phrases. What about another subtype that would apply to the telephone numbers and filenames, e.g. flat:mtw
for "multi-token word"?
If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered, also because many people expect the term "fixed" to cover morphosyntactically fixed expressions in general, whereas it is only intended for a small list of grammatical ones.
By the way, I am also a bit confused by @dan-zeman, @jnivre, @nschneid answers about
fixed
, saying that "fixed
is for words-with-spaces". Do you really consider that whether or not, according to, all but, etc. are words of English? (see https://universal.grew.fr/?custom=66598aedccf83).
The current list of English fixed
expressions is documented here. It is largely inherited from the Stanford Dependencies annotation of EWT, and there are definitely debatable cases in this list, as well as others that maybe should be added to the list (https://github.com/UniversalDependencies/UD_English-EWT/issues/400). I'm happy to discuss those separately, but for purposes of the present discussion, we should go by the universal definition at https://universaldependencies.org/u/dep/fixed.html.
I would like to express my support for @nschneid's suggestion that
If there is a UDv3 I do think the goeswith/fixed/flat relations should be reconsidered,
It is obvious from this discussion that so many long-time UD experts have different intuitions on how these relations should be used. And although the guidelines for fixed have been updated they are still not detailed enough. What is actually meant by 'the most grammaticalized cases'? In the paper @jnivre refers to, I try to identify (in Swedish) what I call rigid expressions, i.e. those showing no variation at all. But they are still too numerous to qualify as 'a closed class'.
The comment by @sylvainkahane that he sees flat as a relation for headless constructions I find interesting. The problem is that UD currently only recognizes one such construction, ie names. Currently, fixed is used for many expressions that have an internal head, such as ADP + NOUN which we may call 'headed constructions' with the noun as the head even if it is non-determined. If UD keeps only one deprel for headless constructions, the distinction between names and fixed non-headed expressions (and typos) could instead be made with features, say in the MISC column. And with a feature for fixedness the headed fixed expressions could have both their syntax annotated (with deprels) and their status as fixed expressions represented.
The email genre of English-EWT lists file attachments, e.g. "Constellation Power (GISB draft).doc".
compound
relations and anappos
relation for the parenthetical. I'm not sure how ".doc" should attach—flat
?