UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
205 stars 44 forks source link

Spec for the validator? #20

Closed vcvpaiva closed 5 years ago

vcvpaiva commented 7 years ago

question: where is the documentation on which rules the validator checks, please? I know it checks for single-roots, but what else?

martinpopel commented 7 years ago

The validator's specification is here, but some rules are described only in the source code validate.py and follow other guidelines from the UD specification (e.g. that the set of language-specific deprels or forms/lemmas which are allowed to have space, must be listed in a special file).

Note that for CoNLL2017 parsers' output some rules will not be enforced (e.g. those regarding SpaceAfter=No, sent_id, text, space in forms/lemmas).

A full list of the validator rules (error messages) can be obtained with ack -o 'warn\(u(.)(.+)\1' validate.py | cut -c7-:

"Spurious empty line.",u"Format"
"Spurious comment line.",u"Format"
"The line has %d columns, but %d are expected."%(len(cols),COLCOUNT),u"Format"
"Spurious line: '%s'. All non-empty lines should start with a digit or the # character."%(line),u"Format"
"Missing empty line after the last tree.",u"Format"
"Spurious sent_id line: '%s' Should look like '# sent_id = xxxxxx' where xxxx is not whitespace. Forward slash reserved for special purposes." %c,u"Metadata"
"Missing the sent_id attribute.",u"Metadata"
"Multiple sent_id attribute.",u"Metadata"
"Non-unique sent_id the sent_id attribute: "+sid,u"Metadata"
"The forward slash is reserved for special use in parallel treebanks: "+sid,u"Metadata"
"Missing the text attribute.",u"Metadata"
"Multiple text attributes.",u"Metadata"
"The text attribute must not end with a whitespace",u"Metadata"
"NoSpaceAfter=Yes should be replaced with SpaceAfter=No",u"Metadata"
"There should not be a SpaceAfter=No entry for empty words",u"Metadata"
"Non-integer range %s-%s (%s)"%(beg,end,e),u"Format"
"There should not be a SpaceAfter=No entry for words which are a part of a token",u"Metadata"
"Mismatch between the text attribute and the FORM field. Form is '%s' but text is '%s...'"%(cols[FORM],stext[:len(cols[FORM])+20]),u"Metadata"
"SpaceAfter=No is missing in the MISC field of node #%s because the text is '%s'"%(cols[ID],shorten(cols[FORM]+stext)),u"Metadata"
"Extra characters at the end of the text attribute, not accounted for in the FORM fields: '%s'"%stext,u"Metadata"
"Unexpected ID format %s" % cols[ID], u"Format"
"Empty value in column %s"%(COLNAMES[col_idx]),u"Format"
"Initial whitespace not allowed in column %s"%(COLNAMES[col_idx]),u"Format"
"Trailing whitespace not allowed in column %s"%(COLNAMES[col_idx]),u"Format"
"White space not allowed in the %s column: '%s'"%(COLNAMES[col_idx],cols[col_idx]),u"Format"
"'%s' in column %s is not on the list of exceptions allowed to contain whitespace (data/tokens_w_space.ud and data/tokens_w_space.LANG files)."%(cols[col_idx],COLNAMES[col_idx]),u"Format"
"A token line must have '_' in the column %s. Now: '%s'."%(COLNAMES[col_idx],cols[col_idx]),u"Format"
"An empty node must have '_' in the column %s. Now: '%s'."%(COLNAMES[col_idx],cols[col_idx]),u"Format"
"Morphological features must be sorted: '%s'"%feats,u"Morpho"
"Spurious morphological feature: '%s'. Should be of the form attribute=value and must start with [A-Z0-9] and only contain [A-Za-z0-9]."%f,u"Morpho"
"Repeated feature values are disallowed: %s"%feats,u"Morpho"
"If an attribute has multiple values, these must be sorted as well: '%s'"%f,u"Morpho"
"Incorrect value '%s' in '%s'. Must start with [A-Z0-9] and only contain [A-Za-z0-9]."%(v,f),u"Morpho"
"Unknown attribute-value pair %s=%s"%(attr,v),u"Morpho"
"Repeated features are disallowed: %s"%feats, u"Morpho"
"Unknown UPOS tag: %s"%cols[UPOSTAG],u"Morpho"
"Unknown XPOS tag: %s"%cols[XPOSTAG],u"Morpho"
"Unknown UD DEPREL: %s"%cols[DEPREL],u"Syntax"
"Malformed head:deprel pair '%s'"%head_deprel,u"Syntax"
"Unknown dependency relation '%s' in '%s'"%(deprel,head_deprel),u"Syntax"
"Failed for parse DEPS: %s" % cols[DEPS],u"Syntax"
"Spurious token interval definition: '%s'."%cols[ID],u"Format"
"Multiword range not before its first word",u"Format"
"Empty node id %s, expected %d.%d"
"Words do not form a sequence. Got: %s."%(u",".join(unicode(x) for x in words)),u"Format"
"Suprious token interval %d-%d"%(b,e),u"Format"
"Suprious token interval %d-%d"%(b,e),u"Format"
"Undefined ID in HEAD: %s" % cols[HEAD],u"Format"
"Failed for parse DEPS: %s" % cols[DEPS],u"Format"
"Undefined ID in DEPS: %s" % head,u"Format"
"Loop from %s" % dependent,u"Syntax"
"Failed to parse ID %s" % cols[ID],u"Format"
"Invalid range: %s" % cols[ID],u"Format"
"Range overlaps with others: %s" % cols[ID],u"Format"
'DEPREL must be "root" if HEAD is 0'
'DEPREL can only be "root" if HEAD is 0'
"Failed to parse DEPS: %s" % cols[DEPS],u"Format"
"DEPS not sorted by head index: %s" % cols[DEPS],u"Format"
"Non-numeric ID: %s" % cols[ID],u"Format"
"ID in DEPS for %s" % cols[ID],u"Format"
"Empty head for word ID %s" % cols[ID], u"Format"
"Non-integer ID: %s" % cols[ID], u"Format"
"Non-integer head for word ID %s" % cols[ID], u"Format"
"HEAD == ID for %s" % cols[ID], u"Format"
"Multiple root words: %s"%list(root_deps), u"Syntax"
"Non-tree structure. Words %s are not reachable from the root 0."%(u",".join(unicode(w) for w in sorted(unreachable))),u"Syntax"
"Spurious language-specific relation '%s' - not an extension of any UD relation."%v,u"Syntax"
"Spurious language-specific relation '%s' - not an extension of any UD relation."%v,u"Syntax"
"The language-specific file data/deprel.%s could not be found. Dependency relations will not be checked.\nPlease add the language-specific dependency relations using python conllu-stats.py --deprels=langspec yourdata/*.conllu > data/deprel.%s\n Also please check that file for errorneous relations. It's okay if the file is empty, but it must exist.\n\n"%(args.lang,args.lang),"Language specific data missing"
"The language-specific file data/feat_val.%s could not be found. Feature=value pairs will not be checked.\nPlease add the language-specific pairs using python conllu-stats.py --catvals=langspec yourdata/*.conllu > data/feat_val.%s It's okay if the file is empty, but it must exist.\n \n\n"%(args.lang,args.lang),"Language specific data missing"
"Exception caught!",u"Format"
vcvpaiva commented 7 years ago

many thanks for the reply @martinpopel . I wouldn't say that the spec you pointed me to is a spec of the validator, it's a spec of the conll-u representations, I think. But the list of error messages is exactly what I was looking after, as they show the things that can be automatically checked.

vcvpaiva commented 7 years ago

@martinpopel just in case you haven't noticed, your error codes have typos in "Suprious token interval %d-%d"%(b,e),u"Format" "Suprious token interval %d-%d"%(b,e),u"Format"

martinpopel commented 7 years ago

I wouldn't say that the spec you pointed me to is a spec of the validator, it's a spec of the conll-u representations

Yes (but most of the validator.py tests are about the CoNLL-U format validity).

I am closing this issue. Feel free to reopen it if there is something important missing in the CoNLL-U specification. However, I think proper documentation of validate.py and its levels of validity must wait until #1 is solved.

olesar commented 5 years ago

Please add 'б' as AUX for Russian 'ru': ['быть', 'бы', 'б']

dan-zeman commented 5 years ago

Done.