UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Feature documentation tools/data/feats.json #1055

Open jheinecke opened 2 months ago

jheinecke commented 2 months ago

I wonder whether I understand well the definition of which features can go (or not) with with UPOS in data/feats.json. For example for English there is the following definition for the feature Tense:

  "Tense": {
    "type": "universal",
    "doc": "local",
    "permitted": 1,
    "errors": [],
    "uvalues": [
      "Past",
      "Pres"
    ],
    "lvalues": [],
    "unused_uvalues": [],
    "unused_lvalues": [],
    "evalues": [],
    "byupos": {
      "ADJ": {
        "Pres": 1
      },
      "AUX": {
        "Past": 7279,
        "Pres": 13130
      },
      "NOUN": {
        "Past": 1
      },
      "SCONJ": {
        "Past": 4,
        "Pres": 1
      },
      "VERB": {
        "Past": 21725,
        "Pres": 14757
      }
    }
  }

Does this mean that tokens with the UPOS NOUN (only value Past) or SCONJ (values Past and Pres) can have a feature Tense ? Or this this an error, and this list was created automatically scanning all English Treebanks? There is in fact one token with Upos NOUN and feature Tense in UD_English-PUD/en_pud-ud-test.conllu (sentence n02034018). Other features like VerbForm also lists NOUN as possible UPOS (or Degree which can go with PUNCT, one instance in UD_English-GUM/en_gum-ud-train.conllu)

for French the feature Mood can go with PRON (example in UD_French-FQB/fr_fqb-ud-test.conllu) and Gender with ADP (also in French-FQB). feats.json allows also Tense for NOUN for French, even there is no instance in any of the French treebanks.

Maybe I have misunderstood the structure of this file (I could find a documentation neither), or are these things to be corrected? The reason I ask is that I would like to exploit this file in ConlluEditor to disallow annotation invalid features for a given UPOS.

nschneid commented 2 months ago

I'll let @dan-zeman answer this—not sure why some values appear to be binary (e.g. for Degree) while others appear to be frequency counts (e.g. Tense).

Tense should only apply to VERB and AUX in English. LinES and PUD each have 1 error. I can fix these.

For Degree, I see in the validator settings that a much wider range of UPOSes are allowed for English than most languages. I don't know why this is the case. The GUM instances that are not ADJ or ADV appear to be errors.

image
jheinecke commented 2 months ago

I just had a closer look, there are many strange things for many languages:

dan-zeman commented 2 months ago

These JSON files were originally meant to be written and read by my scripts only, which is why they are undocumented and sometimes messy. This should be improved when I have time. I did not foresee the use case with annotation tools but it makes perfect sense.

Parts of the file are artifacts of the transition from an older validation procedure to the new one. The permitted UPOS-Feature-Value triples were initialized by collecting their occurrences from the treebanks so that no dataset becomes invalid just by introducing this type of test. Once initialized, people can edit them here and then the value will be boolean. The validator will allow a triple if it finds it in JSON with any nonzero value (if it is 7279, it is clearly the count from some version of the data; if it is 1, it may be result of manual editing, or also a count in case of rare features).

Now the important question what is/is not an error. Clearly there are many triples that the JSON file (and => the validator) allows although they should not be allowed. For example, SCONJ should not have Tense—definitely not in English, but probably not in most other languages. Unfortunately, with the current version of the infrastructure, it is not recommendable to simply go to the page referenced above and uncheck Tense for SCONJ, unless the person doing so is also able to fix the treebanks where this combination occurs.

Ideally we would want to be able to uncheck the wrong combination and leave it up to the treebank maintainers to fix the data. But then they should get a four-year grace period to do so. This is the standard procedure with tests that I implement directly in the validator script. But it does not work (yet) with the feature registration system, where anybody can edit the features, and if a feature is newly disallowed, the treebanks that have it will immediately become invalid (as opposed to LEGACY).

jheinecke commented 1 month ago

Is the information about what has been set manually via the link you gave and what has been counted still available (for those cases where the value is 1)? In this case, could we have (temporarily) a second file (feats2.json) which only contains the Feature/UPOS correspondences edited manually, or those which are frequent enough to be counted as valid (say when 15% of all nouns have Gender=Fem in a given language, then this seems to be correct). For this file we could also delete features which are most likely invalid for any language (like PronType for PUNCT).

jheinecke commented 1 month ago

Apparently there are also cases where a feature is allowed for a given UPOS but never occurs in the data, since it is a wrong assignment (and not assigned for future data), e.g. French allows Definite for NOUN,ADJ and PRON but is (correctly) never used in any of the French treebanks.

dan-zeman commented 1 month ago

The information about what has been set manually is not specifically saved anywhere but there are two sources from which you could deduce it. First, there is the git history. It will tell you that the initialization from data occurred in December 2020. Any later modifications would be either manual edits or pre-generated records when a new language was added to UD (but these would have no data-induced feature counts). The second (and arguably easier to use) source of information is the lastchanger field, which only appears in manually edited records, e.g.:

"lastchanged": "2022-10-27-19-59-14", "lastchanger": "alexeykosh"

Nevertheless, it may not be too informative about features you may want to disallow (while the validator still accepts them) because of the reasons I indicated earlier: people are discouraged from removing stuff, manual editions typically mean that they added stuff.

dan-zeman commented 1 month ago

could we have (temporarily) a second file (feats2.json) which only contains the Feature/UPOS correspondences edited manually, or those which are frequent enough to be counted as valid (say when 15% of all nouns have Gender=Fem in a given language, then this seems to be correct). For this file we could also delete features which are most likely invalid for any language (like PronType for PUNCT).

I think you could read the current feats.json and generate feats2.json for your purposes as you see fit. You could even allow the user to add their own requirements (e.g. that every VERB that has VerbForm=Part must have non-empty value of the Gender feature). You would just need to make sure that the generation can be re-run every time the source feats.json changes.

jheinecke commented 1 month ago

The information about what has been set manually is not specifically saved anywhere but there are two sources from which you could deduce it. First, there is the git history. It will tell you that the initialization from data occurred in December 2020. Any later modifications would be either manual edits or pre-generated records when a new language was added to UD (but these would have no data-induced feature counts). The second (and arguably easier to use) source of information is the lastchanger field, which only appears in manually edited records, e.g.:

"lastchanged": "2022-10-27-19-59-14", "lastchanger": "alexeykosh"

The lastchanged field may indeed help, but even then Features can be wronly assigned (like Degree, a valid feature for PUNCT in English :-) I was looking for something updated automatically on the UD github, and not something I would maintain with ConlluEditor. If I wrote a script producing feats2.json from feats.json and the current UD data, could this run somewhere on UD github ?

dan-zeman commented 1 month ago

I don't think it would make sense to update it automatically on GitHub. Also because various people and various applications may prefer different kinds of modifying the file.

amir-zeldes commented 1 month ago

The GUM instances that are not ADJ or ADV appear to be errors

Yes, sorry about that - those have all been fixed upstream already and will propagate to the UD repo on the next release.