UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
205 stars 44 forks source link

Relaxed criteria #1

Closed fginter closed 5 years ago

fginter commented 10 years ago

Allow a basic CoNLL-U format check without the extra character set and symbol list restrictions imposed by UD.

martinpopel commented 7 years ago

As discussed elsewhere, this format only check could either

It is worth considering which tests should be included in the format-only version. We agree lang-specific deprels and spaces in forms/lemmas should be allowed. What about

dan-zeman commented 5 years ago

The new validator can test on 5 levels (the --level option):

  1. Only backbone CoNLL-U. Fields can be empty, except for the ID column.
  2. Universal-level. UPOS, HEAD, DEPREL and DEPS must have expected values. Language-specific extensions are not checked and all are allowed (i.e., there can be any feature-value pair in FEATS).
  3. Universal content-based tests. E.g. conj must go left-to-right.
  4. Language-specific features, deprels, tokens with spaces. You can still use lang "ud" with this level to require that there are no language-specific extensions.
  5. Language-specific content-based tests. E.g. list of lemmas that can be auxiliary verbs in the language.