Open jheinecke opened 1 month ago
I'll let @dan-zeman answer this—not sure why some values appear to be binary (e.g. for Degree
) while others appear to be frequency counts (e.g. Tense
).
Tense
should only apply to VERB
and AUX
in English. LinES and PUD each have 1 error. I can fix these.
For Degree
, I see in the validator settings that a much wider range of UPOSes are allowed for English than most languages. I don't know why this is the case. The GUM instances that are not ADJ
or ADV
appear to be errors.
I just had a closer look, there are many strange things for many languages:
Number
is assigned to UPOS "AdJ" (small d) and "SCON" (missing J)These JSON files were originally meant to be written and read by my scripts only, which is why they are undocumented and sometimes messy. This should be improved when I have time. I did not foresee the use case with annotation tools but it makes perfect sense.
Parts of the file are artifacts of the transition from an older validation procedure to the new one. The permitted UPOS-Feature-Value triples were initialized by collecting their occurrences from the treebanks so that no dataset becomes invalid just by introducing this type of test. Once initialized, people can edit them here and then the value will be boolean. The validator will allow a triple if it finds it in JSON with any nonzero value (if it is 7279, it is clearly the count from some version of the data; if it is 1, it may be result of manual editing, or also a count in case of rare features).
Now the important question what is/is not an error. Clearly there are many triples that the JSON file (and => the validator) allows although they should not be allowed. For example, SCONJ
should not have Tense
—definitely not in English, but probably not in most other languages. Unfortunately, with the current version of the infrastructure, it is not recommendable to simply go to the page referenced above and uncheck Tense
for SCONJ
, unless the person doing so is also able to fix the treebanks where this combination occurs.
Ideally we would want to be able to uncheck the wrong combination and leave it up to the treebank maintainers to fix the data. But then they should get a four-year grace period to do so. This is the standard procedure with tests that I implement directly in the validator script. But it does not work (yet) with the feature registration system, where anybody can edit the features, and if a feature is newly disallowed, the treebanks that have it will immediately become invalid (as opposed to LEGACY).
Is the information about what has been set manually via the link you gave and what has been counted still available (for those cases where the value is 1)? In this case, could we have (temporarily) a second file (feats2.json
) which only contains the Feature/UPOS correspondences edited manually, or those which are frequent enough to be counted as valid (say when 15% of all nouns have Gender=Fem
in a given language, then this seems to be correct). For this file we could also delete features which are most likely invalid for any language (like PronType
for PUNCT
).
Apparently there are also cases where a feature is allowed for a given UPOS but never occurs in the data, since it is a wrong assignment (and not assigned for future data), e.g. French allows Definite
for NOUN
,ADJ
and PRON
but is (correctly) never used in any of the French treebanks.
The information about what has been set manually is not specifically saved anywhere but there are two sources from which you could deduce it. First, there is the git history. It will tell you that the initialization from data occurred in December 2020. Any later modifications would be either manual edits or pre-generated records when a new language was added to UD (but these would have no data-induced feature counts). The second (and arguably easier to use) source of information is the lastchanger
field, which only appears in manually edited records, e.g.:
"lastchanged": "2022-10-27-19-59-14", "lastchanger": "alexeykosh"
Nevertheless, it may not be too informative about features you may want to disallow (while the validator still accepts them) because of the reasons I indicated earlier: people are discouraged from removing stuff, manual editions typically mean that they added stuff.
could we have (temporarily) a second file (
feats2.json
) which only contains the Feature/UPOS correspondences edited manually, or those which are frequent enough to be counted as valid (say when 15% of all nouns haveGender=Fem
in a given language, then this seems to be correct). For this file we could also delete features which are most likely invalid for any language (likePronType
forPUNCT
).
I think you could read the current feats.json
and generate feats2.json
for your purposes as you see fit. You could even allow the user to add their own requirements (e.g. that every VERB
that has VerbForm=Part
must have non-empty value of the Gender
feature). You would just need to make sure that the generation can be re-run every time the source feats.json
changes.
The information about what has been set manually is not specifically saved anywhere but there are two sources from which you could deduce it. First, there is the git history. It will tell you that the initialization from data occurred in December 2020. Any later modifications would be either manual edits or pre-generated records when a new language was added to UD (but these would have no data-induced feature counts). The second (and arguably easier to use) source of information is the
lastchanger
field, which only appears in manually edited records, e.g.:
"lastchanged": "2022-10-27-19-59-14", "lastchanger": "alexeykosh"
The
lastchanged
field may indeed help, but even then Features can be wronly assigned (likeDegree
, a valid feature forPUNCT
in English :-) I was looking for something updated automatically on the UD github, and not something I would maintain with ConlluEditor. If I wrote a script producing feats2.json from feats.json and the current UD data, could this run somewhere on UD github ?
I don't think it would make sense to update it automatically on GitHub. Also because various people and various applications may prefer different kinds of modifying the file.
The GUM instances that are not ADJ or ADV appear to be errors
Yes, sorry about that - those have all been fixed upstream already and will propagate to the UD repo on the next release.
I wonder whether I understand well the definition of which features can go (or not) with with UPOS in data/feats.json. For example for English there is the following definition for the feature
Tense
:Does this mean that tokens with the UPOS
NOUN
(only valuePast
) orSCONJ
(valuesPast
andPres
) can have a featureTense
? Or this this an error, and this list was created automatically scanning all English Treebanks? There is in fact one token with UposNOUN
and featureTense
in UD_English-PUD/en_pud-ud-test.conllu (sentencen02034018
). Other features likeVerbForm
also listsNOUN
as possible UPOS (orDegree
which can go withPUNCT
, one instance in UD_English-GUM/en_gum-ud-train.conllu)for French the feature
Mood
can go withPRON
(example in UD_French-FQB/fr_fqb-ud-test.conllu) andGender
with ADP (also in French-FQB).feats.json
allows alsoTense
forNOUN
for French, even there is no instance in any of the French treebanks.Maybe I have misunderstood the structure of this file (I could find a documentation neither), or are these things to be corrected? The reason I ask is that I would like to exploit this file in ConlluEditor to disallow annotation invalid features for a given UPOS.