Closed spyysalo closed 10 years ago
1) relations: [a-z-]+ 2) No. [prev. decision on this] 3) Yes. [prev. decision on this] 4) Cannot see why they couldn't. 5) I don't think the ranges should be allowed to overlap.
@fginter : thanks! If there are no objections, I'll check to make sure the constraints are documented and introduce relevant test cases.
Ah, forgot to ask about the POSTAG
field. Do we constrain this to [A-Z]+
like CPOSTAG
(by definition as a closed set w/that form)?
No sorry. I'm deleting my "why not" comment. POSTAG
carries the language-dependent tag. I think we shouldn't restrict / talk into these. People will get frustrated if their corpora don't pass the validator because of this. I think POSTAG
we should let be more or less at "whatever you want, it's your data".
I see the argument, but think that "whatever you want" is too liberal. Unicode is full of surprises, and whitelisting (as opposed to blacklisting) would help avoid potential future issues also for POSTAG
. If we want to be permissive, perhaps we could see what's been used previously (e.g. CoNLLs) and define a superset?
Also, if this argument holds for POSTAG
, why not for DEPREL
and DEPS
, which may also contain language-specific types?
(IMHO of course.)
Returning to a point for DEPREL
and DEPS
, just to make sure: the proposed constraint for relations, [a-z-]+
, would rule out relation forms such as conj_and
and prep_on
, previously used for (collapsed, propagated) SD. Is this OK?
(I'm not sure who is getting notifications here, so: @jnivre , @dan-zeman and everyone else, comments very welcome!)
These are tricky questions. I would lean towards allowing underscores (but nothing else) in DEPREL and DEPS labels. Note that we don't allow arbitrary treebank-specific stuff here, only language-specific subtypes of universal relations, so people will have to stick to the rules. For the POSTAG field, I would ask Dan what is needed to allow all the tagsets in Interset while still ruling out Unicode surprises.
@jnivre : thank you for the input! I am also in favor of fairly restrictive constraints here.
Just to note: constraining DEPREL
and DEPS
labels to [a-z_]+
would rule out some types that are currently included in the Finnish documentation (http://universaldependencies.github.io/docs/fi-dep-index.html) such as nsubj-cop
(http://universaldependencies.github.io/docs/fi-dep/nsubj-cop.html).
Okay, let's allow the hyphen too. But that's my limit. :)
Thanks, @jnivre. nsubj-cop
and nmod-own
appreciate it. :) I personally prefer the dash over underscore because it looks better in text in papers. It's more human. POSTAG: okay, I stand overvoted, no problem. I will modify the validator accordingly, and expand it once Dan replies. I'll assign this issue to myself not to forget.
+1 for dash over underscore if there's a need to choose one over the other; what @fginter said :-)
(Nitpick: I'd like to further narrow down the DEPREL
/ DEPS
label constraint to [a-z][a-z_-]*
, disallowing dash and underscore in the initial position. Or maybe even [a-z]([a-z]|[_-]+[a-z])*
, disallowing also the final position.)
Attempting a partial summary of suggested answers so far:
[a-z][a-z_-]*
HEAD = 0
(and DEPREL = root
)HEAD = 0
(and DEPREL = root
)DEPS
) may also contain HEAD = 0
(and DEPREL = root
) dependenciesPOSTAG
forms should be constrained (roughly) to the minimum needed to allow all the tagsets in Interset (waiting for input from @dan-zeman )(Please correct me if anything here is wrong.)
Yet a quick comment. This format will hopefully see adoption also outside the UD project. The validation script should support this. I might want to modify the script such that these UD-specific restrictions on character sets can be switched off, while still keeping the other checks in. Not a huge priority atm, of course.
Note: fd6bee8c2ac97dde960fb0ea8703d19aa0f48157 includes a guess at the POSTAG
form constraints that should be updated when this is decided on.
If we are restricting the set of characters allowed in DEPREL, then I do not think that both "_" and "-" are necessary. They serve the same purpose, don't they? I like the dash more. And I agree that it should only be used in a middle position.
As for the POSTAG, what exactly do you mean by "Unicode surprises"? I have seen all of the following in POSTAGS, though rarely:
If we do not want to force the users to modify existing corpus-specific tags, we have to allow Unicode. On the other hand, we have to at least require that there are no whitespaces inside the tag, and I should have mentioned above that whitespaces occur too.
I currently lean towards allowing arbitrary POSTAG content similarly to FORM and LEMMA, i.e. any UTF-8 string, with the exception that control characters and whitespace characters are not allowed (and recommend that such characters are substituted by underscore (or dash?))
I'd generally be happy to vote for excluding underscore, but worry about breaking compatibility with (collapsed, propagated) SD.
Perhaps @mcdm and/or @manning could help here: what is your preference regarding the characters to allow for DEPREL
? Would [a-z-]+
be acceptable even if this meant using forms such as conj-and
instead of conj_and
?
Regarding POSTAG
, yes, space (esp. non-ASCII space) would certainly qualify as a surprise, as would control characters. I'm also wary of allowing arbitrary character sets and non-alnum, as these may cause difficulties for related tools. (Just as one example, the current documentation system uses labels as filenames and would break in interesting ways for <
.)
As we're asking users to modify other corpus-specific aspects, perhaps we could also constrain POSTAG
beyond the minimum of excluding space and control chars? This wouldn't need to be more than a simple mapping.
OK, let's not break compatibility, both underscore and "-" will work. But if we can agree on one candidate, then I would add a recommendation, which of these two characters the newly defined language-specific labels should use. So that we encourage uniform look at least for future additions.
Regarding POSTAG
, I agree that special characters such as <
represent potential problems. Still, I would not ban them all just because some are dangerous. Can we enumerate what the dangerous characters are? I think the following cannot or should not be used in filenames:
: / \ ? *
the following are better avoided because of HTML/XML:
< > &
not using quotation marks, brackets and other chars will make life easier within various shells:
" ' ` # ( ) [ ] { } |
Anything else?
The following non-alnum ASCII characters have not been mentioned above:
! $ % + , - . ; = @ ^ _ ~
It should be safe to allow alphanumeric characters from the whole Unicode, although it will be very rare to actually encounter any non-ASCII letters (as far as I remember, they are only in SynTagRus/Russian, and in one Japanese tagset for which I don't even have the corpus, I just have the tagset described in Interset by one of our students). ASCII non-alphanumeric occur much more often than non-ASCII alphanumeric :-)
If we're considering potential shell trouble as a criterion for exclusion, at least !
, &
, ;
and ~
are bash specials, likely others also.
Given the enormous range of Unicode, I'd prefer whitelisting instead of blacklisting and would suggest not to consider anything but alnum.
I don't know. These are tags taken from other corpora and having to translate them creates confusion. Do we really need to be able to use them as filenames, unquoted identifiers or whatever?
The Czech PDT tags (positional, 15 characters each) contain the following characters: !#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~
Other tagsets:
ar::conll -1234=ACDFGIMNPQRSVXYZ_acdefgimnoprsuv| ar::conll2007 -1234=ACDFGIJMNPQRSVYZ_abcdefimnorsu| ar::padt -1234ACDEFGIJLMNPQRSUVXYZ bg::conll 123=ACDHIMNPRTV_acdefghilmnopqrstuvxyz| bn::conll +-01234567ABCDEFGIJKLMNPQRSTUVWXYZ_abcdefghijklmnoprstuvwxyz| ca::conll2009 123=_abcdefghijklmnopqrstuvwxyz| cs::ajka 0123456789ABCDFILMNOPQRSTYacdeghkmnptxy cs::cnk !#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~ cs::conll !#*,.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^_abcdefghijklmnopqrstuvwxyz|}~ cs::conll2009 !#*,.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz|}~ cs::multext -123ACIMNPQRSVXYacdfgilmnopqrstvxyz cs::pdt !#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~ cs::pmk -/0123456789<>CFJPZi cs::pmkkr -/0123456789<>CFJPZ_i da::conll /123=ACDEFGINOPRSTUVX_abcdefgijklmnoprstuvxy| de::conll $(,.ACDEFGIJKLMNOPRSTUVWXYZ_ de::conll2009 $(*,.123ACDEFGIJKLMNOPRSTUVWXYZ_abcdefgjlmnoprstuz| de::stts $(,.ACDEFGIJKLMNOPRSTUVWXYZ el::conll 0123ABCDEFGILMNOPRSTUVWX_abcdefgijlmnoprstuvwx| en::conll #$'(),.:ABCDEFGHIJLMNOPRSTUVWXYZ_` en::conll2009 #$'(),.:BCDEFGHIJLMNOPRSTUVWXYZ_` en::penn #$',-.:ABCDEFGHIJLMNOPRSTUVWXYZ` es::conll2009 123=_abcdefghijklmnopqrstuvwxyz| et::puudepank eu::conll +-/12345678:?ABCDEFGHIJKLMNOPRSTUWZ_abdeghiklmnoprstuz| fa::conll 123=ABCDEFGHIJLMNOPQRSTUVXY_abcehmnoprstu| fi::conll -/1234ABCDEFGHIJKLMNOPQRSTUVWXZacdeghijklmnopqrstuv|ä grc::conll -123=acdefgilmnoprstuvx| he::conll hi::conll &'+-0123>ABCDEFGIJKLMNOPQRSTUVWXYZ_abcdefghijklmnoprstuvwy|ँंअआईउऊएओकखगचछजटठडणतथदधनपफबभमयरलवशषसह़ािीुूेैोौ्ज़ड़०१ hr::multext -123ACIMNPQRSVXYacdefgilmnopqrsvxyz hu::conll 123=ACIMNOPRSTUVWXYZ_abcdefghilmnopqrstuvxy| it::conll 123=ABCDEFGIMNOPQRSTUVWX_degmnoprstu| ja::conll ,-.?ACDEFGIJMNOPQRSTUVX_abcdefghijklmnopqrstuwx ja::ipadic -そのァアィイサットナフベラルー一並人他代体係列副助動化句可号名固国地域変姓字容尾幹引弧形感投括接数文断有格殊点片特用白的空立約終組続縮織能自般言記詞語読連閉開間非音頭類/ la::conll -123=abcdefgilmnoprstuv| mul::google .ABCDEJMNOPRTUVX nl::cgn nl::conll 123ACIMNPUVW_abcdefghijklmnoprstuvwz| no::conll +-<>abcdefgijklmnoprstuvy pl::conll2009 123:_abcdefgijklmnopqrstuvwxz pl::ipipan 123:abcdefgijklmnopqrstuvwxz pt::conll -/123<>?ABCDEFIJKLMNOPQRSTUV_abcdefghijklmnopqrstuvy| ro::rdt .abcdefhijlmnoprstuvxz ru::syntagrus -123ACDIJMNOPRSTUVАВГДЕЖЗИКЛМНОПРСТУФЧШЪЯ sk::snk +-1234567ABDEFGHIJKLMNOPQRSTUVWYabcdefghijkmnpstuvxyz sl::conll -=ACDFGINOPRSTUV_abcdefghijlmnopqrstuvwxy| sv::conll +?ABCDEFGHIJKMNOPQRSTUVWXY_ sv::hajic -0ACDEFGHIMNOPQRSUVWX sv::mamba +?ABCDEFGHIJKMNOPQRSTUVWXY ta::tamiltb #-123:=ABCDEFGHIJLMNOPQRSTUVWZ_abdeghijklmnopqrstuvwxyz| ta::tamiltbv1 ta::tamiltbv1l2 te::conll +-0123ABCDEFGHIJKLMNOPQRSTUVWXY_abcdefgijklmnoprstuvwxy| tr::conll 123ABCDEFGHIJLNOPQRSVW_abcdefghijklmnopqrstuvwxy| tr::trmorph 123:<>ACDEINOPQVabcdefghijlmnopqrstuvx tr::trmorph022 123<>_abcdefgijklmnopqrstuvxyz ur::conll zh::conll +,0123456789ABCDEFGHIJKLMNOPSTV[]_abcdefghijkpqrstuv}
Do we really need to be able to use them as filenames, unquoted identifiers or whatever?
No, not really. Allowing "special" non-alnum chars such as <
will require rewriting parts of the documentation + visualization system (and some care from doc authors), but I won't try to claim that my technical dept is a serious argument for constraining the format :-)
On the other hand, if the format is (reasonably) widely adopted, there will hopefully be many other tools also, mostly not written by us. I think there is a reasonable argument that the format should generally favor making things difficult for data producers (small group) rather than consumers (larger group) when the two are in conflict.
(Then again, I don't think this primarily technical perspective should carry too much weight here. I'd be happy to follow whatever you and others prefer!)
@dan-zeman : thank you for the comprehensive injection of data into the discussion! It's certainly true that these tagsets involve a very broad range of characters.
(Sidenote: above, the characters @
and (I think) triggered special processing in GitHub's comment system, and some characters in the
hi::conll` set are rendered blank in my (reasonably modern) browser.)
Yes, I'd like to hear others, too. Originally I thought that we may want to restrict the charset here but as I saw how much we would have to restrict it if it should be effective, I wonder if it's worth doing. As a consumer, I think that the original tags (say, PDT tags) are useful only if they look as they do in PDT. Otherwise, I can as well ignore them and go directly to the universal tag and features (provided the same information is contained there).
These are just pieces of data from outer sources. So now I tend to believe that restricting their content does not have to be stronger than restrictions placed on FORM and LEMMA.
On vs. - in dependency names: I can see - being nicer too. But we have historically used for years in our "collapsed" ("enhanced") dependencies, so it might just be easiest/nicest to retain it. I'm okay to recommend - going forward and to only allow them medially.
On tag sets: I think it is a reasonably big nuisance if you can't use existing widely used tag sets in the POSTAG column, and, as noted exhaustively by Dan, a lot of characters a used. Even that English-centric most common one, the Penn Tag set has quite a few characters that are special by certain standards: # $ ' ` , . : I think it would be far preferable to allow them in CoNLL-U. Unfortunately, that means that there is already one there that is problematic in filenames (:), so some form of encoding is probably needed if you are using these as filenames (though I don't think that's needed for POSTAG). While non-Latin1 characters could be excluded, it makes it all a bit English/Latinate writing system-centric, so I'd be tempted to allow a broad range of printable characters in the POSTAG column. The Hindi characters in hi::conll render fine in my browser. :) I doubt is a browser issue; probably a font issue.
Agreed, losing the ability to represent PTB tags would be quite a bit of a negative. I'm happy to withdraw my earlier suggestion and have updated the format document to remove the previously suggested constraint.
It also appears there is full agreement in the allowing underscore in deprel but encouraging dash for new types. I'll add this to the docs.
(Meta: I'm not sure why this issue was left open in the tracker, the open questions appear to have been decided. For reference, I'm now interpreting the conclusion re allowed POSTAG
characters simply as Unicode [[:graph:]]
, i.e. any visible character.)
Some more questions regarding the format spec (http://universaldependencies.github.io/docs/format.html):
[A-Z0-9][a-zA-Z0-9]*
. Are dependency relations similarly required to have a particular form (e.g.[a-z]+
)?HEAD = 0
(root
relations)?HEAD = 0
(root
relations)? (yes for CoNLL-X)root
) dependencies occur also in secondary dependencies (DEPS
)?(+extra non-dep question:) May multiword token ranges overlap? Intuitively no, but this doesn't seem to be explicit in the docs. Consider e.g. (nonsense example)