format documentation: some (mostly) HEAD and DEPS questions

spyysalo commented 10 years ago

Some more questions regarding the format spec (http://universaldependencies.github.io/docs/format.html):

It was decided (#33) that feature names and values must have the form [A-Z0-9][a-zA-Z0-9]*. Are dependency relations similarly required to have a particular form (e.g. [a-z]+)?
May a CoNLL-U sentence to have no words with HEAD = 0 (root relations)?
May a CoNLL-U sentence to have more than one word with HEAD = 0 (root relations)? (yes for CoNLL-X)
May head 0 (root) dependencies occur also in secondary dependencies (DEPS)?

(+extra non-dep question:) May multiword token ranges overlap? Intuitively no, but this doesn't seem to be explicit in the docs. Consider e.g. (nonsense example)

1   I   I   PRON    PRN Num=Sing|Per=1  2   nsubj   _   _
2-3 haven't _   _   _   _   _   _   _   _
2   have    have    VERB    VB  Tens=Pres   0   root    _   _
3-4 nota    _   _   _   _   _   _   _   _
3   not not ADV RB  _   2   neg _   _
4   a   a   DET DT  _   5   det _   _

fginter commented 10 years ago

1) relations: [a-z-]+ 2) No. [prev. decision on this] 3) Yes. [prev. decision on this] 4) Cannot see why they couldn't. 5) I don't think the ranges should be allowed to overlap.

spyysalo commented 10 years ago

@fginter : thanks! If there are no objections, I'll check to make sure the constraints are documented and introduce relevant test cases.

spyysalo commented 10 years ago

Ah, forgot to ask about the POSTAG field. Do we constrain this to [A-Z]+ like CPOSTAG (by definition as a closed set w/that form)?

fginter commented 10 years ago

No sorry. I'm deleting my "why not" comment. POSTAG carries the language-dependent tag. I think we shouldn't restrict / talk into these. People will get frustrated if their corpora don't pass the validator because of this. I think POSTAG we should let be more or less at "whatever you want, it's your data".

spyysalo commented 10 years ago

I see the argument, but think that "whatever you want" is too liberal. Unicode is full of surprises, and whitelisting (as opposed to blacklisting) would help avoid potential future issues also for POSTAG. If we want to be permissive, perhaps we could see what's been used previously (e.g. CoNLLs) and define a superset?

Also, if this argument holds for POSTAG, why not for DEPREL and DEPS, which may also contain language-specific types?

(IMHO of course.)

spyysalo commented 10 years ago

Returning to a point for DEPREL and DEPS, just to make sure: the proposed constraint for relations, [a-z-]+, would rule out relation forms such as conj_and and prep_on, previously used for (collapsed, propagated) SD. Is this OK?

spyysalo commented 10 years ago

(I'm not sure who is getting notifications here, so: @jnivre , @dan-zeman and everyone else, comments very welcome!)

jnivre commented 10 years ago

These are tricky questions. I would lean towards allowing underscores (but nothing else) in DEPREL and DEPS labels. Note that we don't allow arbitrary treebank-specific stuff here, only language-specific subtypes of universal relations, so people will have to stick to the rules. For the POSTAG field, I would ask Dan what is needed to allow all the tagsets in Interset while still ruling out Unicode surprises.

spyysalo commented 10 years ago

@jnivre : thank you for the input! I am also in favor of fairly restrictive constraints here.

Just to note: constraining DEPREL and DEPS labels to [a-z_]+ would rule out some types that are currently included in the Finnish documentation (http://universaldependencies.github.io/docs/fi-dep-index.html) such as nsubj-cop (http://universaldependencies.github.io/docs/fi-dep/nsubj-cop.html).

jnivre commented 10 years ago

Okay, let's allow the hyphen too. But that's my limit. :)

fginter commented 10 years ago

Thanks, @jnivre. nsubj-cop and nmod-own appreciate it. :) I personally prefer the dash over underscore because it looks better in text in papers. It's more human. POSTAG: okay, I stand overvoted, no problem. I will modify the validator accordingly, and expand it once Dan replies. I'll assign this issue to myself not to forget.

spyysalo commented 10 years ago

+1 for dash over underscore if there's a need to choose one over the other; what @fginter said :-)

(Nitpick: I'd like to further narrow down the DEPREL / DEPS label constraint to [a-z][a-z_-]*, disallowing dash and underscore in the initial position. Or maybe even [a-z]([a-z]|[_-]+[a-z])*, disallowing also the final position.)

spyysalo commented 10 years ago

Attempting a partial summary of suggested answers so far:

Dependency relations must have the form [a-z][a-z_-]*
Every sentence must have at least one word with HEAD = 0 (and DEPREL = root)
Sentences may have more than one word with HEAD = 0 (and DEPREL = root)
Secondary dependencies (DEPS) may also contain HEAD = 0 (and DEPREL = root) dependencies
Multiword token ranges may not overlap
POSTAG forms should be constrained (roughly) to the minimum needed to allow all the tagsets in Interset (waiting for input from @dan-zeman )

(Please correct me if anything here is wrong.)

fginter commented 10 years ago

Yet a quick comment. This format will hopefully see adoption also outside the UD project. The validation script should support this. I might want to modify the script such that these UD-specific restrictions on character sets can be switched off, while still keeping the other checks in. Not a huge priority atm, of course.

spyysalo commented 10 years ago

Note: fd6bee8c2ac97dde960fb0ea8703d19aa0f48157 includes a guess at the POSTAG form constraints that should be updated when this is decided on.

dan-zeman commented 10 years ago

If we are restricting the set of characters allowed in DEPREL, then I do not think that both "_" and "-" are necessary. They serve the same purpose, don't they? I like the dash more. And I agree that it should only be used in a middle position.

As for the POSTAG, what exactly do you mean by "Unicode surprises"? I have seen all of the following in POSTAGS, though rarely:

non-alphanumeric characters ("<")
cyrillic ("ЖЕН")
kanji (Chinese/Japanese characters)

If we do not want to force the users to modify existing corpus-specific tags, we have to allow Unicode. On the other hand, we have to at least require that there are no whitespaces inside the tag, and I should have mentioned above that whitespaces occur too.

I currently lean towards allowing arbitrary POSTAG content similarly to FORM and LEMMA, i.e. any UTF-8 string, with the exception that control characters and whitespace characters are not allowed (and recommend that such characters are substituted by underscore (or dash?))

spyysalo commented 10 years ago

I'd generally be happy to vote for excluding underscore, but worry about breaking compatibility with (collapsed, propagated) SD.

Perhaps @mcdm and/or @manning could help here: what is your preference regarding the characters to allow for DEPREL? Would [a-z-]+ be acceptable even if this meant using forms such as conj-and instead of conj_and?

Regarding POSTAG, yes, space (esp. non-ASCII space) would certainly qualify as a surprise, as would control characters. I'm also wary of allowing arbitrary character sets and non-alnum, as these may cause difficulties for related tools. (Just as one example, the current documentation system uses labels as filenames and would break in interesting ways for <.)

As we're asking users to modify other corpus-specific aspects, perhaps we could also constrain POSTAG beyond the minimum of excluding space and control chars? This wouldn't need to be more than a simple mapping.

dan-zeman commented 10 years ago

OK, let's not break compatibility, both underscore and "-" will work. But if we can agree on one candidate, then I would add a recommendation, which of these two characters the newly defined language-specific labels should use. So that we encourage uniform look at least for future additions.

Regarding POSTAG, I agree that special characters such as < represent potential problems. Still, I would not ban them all just because some are dangerous. Can we enumerate what the dangerous characters are? I think the following cannot or should not be used in filenames: : / \ ? * the following are better avoided because of HTML/XML: < > & not using quotation marks, brackets and other chars will make life easier within various shells: " ' ` # ( ) [ ] { } | Anything else? The following non-alnum ASCII characters have not been mentioned above: ! $ % + , - . ; = @ ^ _ ~

It should be safe to allow alphanumeric characters from the whole Unicode, although it will be very rare to actually encounter any non-ASCII letters (as far as I remember, they are only in SynTagRus/Russian, and in one Japanese tagset for which I don't even have the corpus, I just have the tagset described in Interset by one of our students). ASCII non-alphanumeric occur much more often than non-ASCII alphanumeric :-)

spyysalo commented 10 years ago

If we're considering potential shell trouble as a criterion for exclusion, at least !, &, ; and ~ are bash specials, likely others also.

Given the enormous range of Unicode, I'd prefer whitelisting instead of blacklisting and would suggest not to consider anything but alnum.

dan-zeman commented 10 years ago

I don't know. These are tags taken from other corpora and having to translate them creates confusion. Do we really need to be able to use them as filenames, unquoted identifiers or whatever?

dan-zeman commented 10 years ago

The Czech PDT tags (positional, 15 characters each) contain the following characters: !#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~

dan-zeman commented 10 years ago

Other tagsets:

ar::conll
    -1234=ACDFGIMNPQRSVXYZ_acdefgimnoprsuv|
ar::conll2007
    -1234=ACDFGIJMNPQRSVYZ_abcdefimnorsu|
ar::padt
-1234ACDEFGIJLMNPQRSUVXYZ
bg::conll
    123=ACDHIMNPRTV_acdefghilmnopqrstuvxyz|
bn::conll
    +-01234567ABCDEFGIJKLMNPQRSTUVWXYZ_abcdefghijklmnoprstuvwxyz|
ca::conll2009
    123=_abcdefghijklmnopqrstuvwxyz|
cs::ajka
0123456789ABCDFILMNOPQRSTYacdeghkmnptxy
cs::cnk
!#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~
cs::conll
    !#*,.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^_abcdefghijklmnopqrstuvwxyz|}~
cs::conll2009
    !#*,.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz|}~
cs::multext
-123ACIMNPQRSVXYacdfgilmnopqrstvxyz
cs::pdt
!#*,-.0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz}~
cs::pmk
-/0123456789<>CFJPZi
cs::pmkkr
-/0123456789<>CFJPZ_i
da::conll
    /123=ACDEFGINOPRSTUVX_abcdefgijklmnoprstuvxy|
de::conll
    $(,.ACDEFGIJKLMNOPRSTUVWXYZ_
de::conll2009
    $(*,.123ACDEFGIJKLMNOPRSTUVWXYZ_abcdefgjlmnoprstuz|
de::stts
$(,.ACDEFGIJKLMNOPRSTUVWXYZ
el::conll
    0123ABCDEFGILMNOPRSTUVWX_abcdefgijlmnoprstuvwx|
en::conll
    #$'(),.:ABCDEFGHIJLMNOPRSTUVWXYZ_`
en::conll2009
    #$'(),.:BCDEFGHIJLMNOPRSTUVWXYZ_`
en::penn
#$',-.:ABCDEFGHIJLMNOPRSTUVWXYZ`
es::conll2009
    123=_abcdefghijklmnopqrstuvwxyz|
et::puudepank
eu::conll
    +-/12345678:?ABCDEFGHIJKLMNOPRSTUWZ_abdeghiklmnoprstuz|
fa::conll
    123=ABCDEFGHIJLMNOPQRSTUVXY_abcehmnoprstu|
fi::conll
    -/1234ABCDEFGHIJKLMNOPQRSTUVWXZacdeghijklmnopqrstuv|ä
grc::conll
    -123=acdefgilmnoprstuvx|
he::conll
hi::conll
    &'+-0123>ABCDEFGIJKLMNOPQRSTUVWXYZ_abcdefghijklmnoprstuvwy|ँंअआईउऊएओकखगचछजटठडणतथदधनपफबभमयरलवशषसह़ािीुूेैोौ्ज़ड़०१
hr::multext
-123ACIMNPQRSVXYacdefgilmnopqrsvxyz
hu::conll
    123=ACIMNOPRSTUVWXYZ_abcdefghilmnopqrstuvxy|
it::conll
    123=ABCDEFGIMNOPQRSTUVWX_degmnoprstu|
ja::conll
    ,-.?ACDEFGIJMNOPQRSTUVX_abcdefghijklmnopqrstuwx
ja::ipadic
-そのァアィイサットナフベラルー一並人他代体係列副助動化句可号名固国地域変姓字容尾幹引弧形感投括接数文断有格殊点片特用白的空立約終組続縮織能自般言記詞語読連閉開間非音頭類／
la::conll
    -123=abcdefgilmnoprstuv|
mul::google
.ABCDEJMNOPRTUVX
nl::cgn
nl::conll
    123ACIMNPUVW_abcdefghijklmnoprstuvwz|
no::conll
+-<>abcdefgijklmnoprstuvy
pl::conll2009
    123:_abcdefgijklmnopqrstuvwxz
pl::ipipan
123:abcdefgijklmnopqrstuvwxz
pt::conll
    -/123<>?ABCDEFIJKLMNOPQRSTUV_abcdefghijklmnopqrstuvy|
ro::rdt
 .abcdefhijlmnoprstuvxz
ru::syntagrus
 -123ACDIJMNOPRSTUVАВГДЕЖЗИКЛМНОПРСТУФЧШЪЯ
sk::snk
+-1234567ABDEFGHIJKLMNOPQRSTUVWYabcdefghijkmnpstuvxyz
sl::conll
    -=ACDFGINOPRSTUV_abcdefghijlmnopqrstuvwxy|
sv::conll
    +?ABCDEFGHIJKMNOPQRSTUVWXY_
sv::hajic
-0ACDEFGHIMNOPQRSUVWX
sv::mamba
+?ABCDEFGHIJKMNOPQRSTUVWXY
ta::tamiltb
    #-123:=ABCDEFGHIJLMNOPQRSTUVWZ_abdeghijklmnopqrstuvwxyz|
ta::tamiltbv1
ta::tamiltbv1l2
te::conll
    +-0123ABCDEFGHIJKLMNOPQRSTUVWXY_abcdefgijklmnoprstuvwxy|
tr::conll
    123ABCDEFGHIJLNOPQRSVW_abcdefghijklmnopqrstuvwxy|
tr::trmorph
123:<>ACDEINOPQVabcdefghijlmnopqrstuvx
tr::trmorph022
123<>_abcdefgijklmnopqrstuvxyz
ur::conll
zh::conll
    +,0123456789ABCDEFGHIJKLMNOPSTV[]_abcdefghijkpqrstuv}

spyysalo commented 10 years ago

Do we really need to be able to use them as filenames, unquoted identifiers or whatever?

No, not really. Allowing "special" non-alnum chars such as < will require rewriting parts of the documentation + visualization system (and some care from doc authors), but I won't try to claim that my technical dept is a serious argument for constraining the format :-)

On the other hand, if the format is (reasonably) widely adopted, there will hopefully be many other tools also, mostly not written by us. I think there is a reasonable argument that the format should generally favor making things difficult for data producers (small group) rather than consumers (larger group) when the two are in conflict.

(Then again, I don't think this primarily technical perspective should carry too much weight here. I'd be happy to follow whatever you and others prefer!)

spyysalo commented 10 years ago

@dan-zeman : thank you for the comprehensive injection of data into the discussion! It's certainly true that these tagsets involve a very broad range of characters.

(Sidenote: above, the characters @ and (I think) triggered special processing in GitHub's comment system, and some characters in thehi::conll` set are rendered blank in my (reasonably modern) browser.)

dan-zeman commented 10 years ago

Yes, I'd like to hear others, too. Originally I thought that we may want to restrict the charset here but as I saw how much we would have to restrict it if it should be effective, I wonder if it's worth doing. As a consumer, I think that the original tags (say, PDT tags) are useful only if they look as they do in PDT. Otherwise, I can as well ignore them and go directly to the universal tag and features (provided the same information is contained there).

These are just pieces of data from outer sources. So now I tend to believe that restricting their content does not have to be stronger than restrictions placed on FORM and LEMMA.

manning commented 10 years ago

On vs. - in dependency names: I can see - being nicer too. But we have historically used for years in our "collapsed" ("enhanced") dependencies, so it might just be easiest/nicest to retain it. I'm okay to recommend - going forward and to only allow them medially.

On tag sets: I think it is a reasonably big nuisance if you can't use existing widely used tag sets in the POSTAG column, and, as noted exhaustively by Dan, a lot of characters a used. Even that English-centric most common one, the Penn Tag set has quite a few characters that are special by certain standards: # $ ' ` , . : I think it would be far preferable to allow them in CoNLL-U. Unfortunately, that means that there is already one there that is problematic in filenames (:), so some form of encoding is probably needed if you are using these as filenames (though I don't think that's needed for POSTAG). While non-Latin1 characters could be excluded, it makes it all a bit English/Latinate writing system-centric, so I'd be tempted to allow a broad range of printable characters in the POSTAG column. The Hindi characters in hi::conll render fine in my browser. :) I doubt is a browser issue; probably a font issue.

spyysalo commented 10 years ago

Agreed, losing the ability to represent PTB tags would be quite a bit of a negative. I'm happy to withdraw my earlier suggestion and have updated the format document to remove the previously suggested constraint.

It also appears there is full agreement in the allowing underscore in deprel but encouraging dash for new types. I'll add this to the docs.

spyysalo commented 10 years ago

(Meta: I'm not sure why this issue was left open in the tracker, the open questions appear to have been decided. For reference, I'm now interpreting the conclusion re allowed POSTAG characters simply as Unicode [[:graph:]], i.e. any visible character.)

UniversalDependencies / docs

format documentation: some (mostly) HEAD and DEPS questions #39