UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

feats.ud used by conllu-stats.py removed from master #78

Closed ksteimel closed 3 years ago

ksteimel commented 3 years ago

Not sure if this is expected or not but on the master branch, the data/feats.ud file is no longer present. It is there in the 2.7 release tag but seems to have been lost since then.

martinpopel commented 3 years ago

I think this was explained in the email sent by @dan-zeman on December 13, 2020 to the UD mailing list ud@stp.lingfil.uu.se:

Dear UDers,

as some of you have noticed, I am reworking the system used by our validator to check language-specific features and relations. There are several reasons for this, the main one being that every feature and relation should be documented, which unfortunately is not always the case.

The language-specific files in the tools repository, data/feat_val.xx and data/deprel.xx, will no longer be used and I have already removed them. Please do not create new ones. Enhanced dependencies are registered the old way for the time being, but data/edeprel.xx will be removed in the future, too.

Language-specific features and relations are now registered using a web-based interface that runs on the same server as our on-line validator, outside Github. Changes are automatically propagated to the tools repository, so you can take the validator script offline and it will still be able to check your data. However, do not modify the new data files in the tools repository! Any such modifications will be rewritten without notice.

You can register only features and relations that are properly documented. Every feature and relation must have its own page in the docs repository, and this page must meet certain requirements so the validator can recognize it. You should link the page from the index.md page of your language-specific documentation, but listing a relation without providing its page will not be enough. However, you do not have to provide language-specific documentation of universal features and relations! It is enough that they are documented globally. There are even some features and relations that are not universal (hence, technically they are language-specific) but they have been attested in many languages and their documentation is available globally: for example, nsubj:pass or acl:relcl. You do not have to document these separately for your language, although you can. If the system does not allow you to register a feature, you must document it for your language.

Important links:

Instructions on writing language-specific documentation: https://universaldependencies.org/contributing_language_specific.html

On-line registration of language-specific feature-value pairs: https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_feature.pl

On-line registration of language-specific dependency relation subtypes: https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_deprel.pl

On-line registration of lemmas of auxiliaries and copulas: https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_auxiliary.pl

On-line validation report: http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/validation-report.pl

Feature-value pairs are now registered not only for individual languages, but also for individual part-of-speech categories. The validator does not yet report an error if a known feature is used with a wrong UPOS tag, but it will do so in the future. Consequently, even universal features have to be registered in the system, reflecting their usage in the given language. For the languages that are already in UD, I have initially allowed all combinations that occur at least once in any treebank of the language.

Please check the validity status of your treebank and provide any missing documentation. Let me know if you have any questions or if anything does not seem to work as expected.

Thanks!

Dan

ksteimel commented 3 years ago

Ah I see! My apologies!

dan-zeman commented 3 years ago

Hmm, the reasons is indeed the one @martinpopel cites, but I didn't realize that there is another script that uses the file. I will have to think what to do with it. Can you use conllu-stats.pl instead?

dan-zeman commented 3 years ago

I commented out the code where the file is loaded. The script should be now able to report the stats while treating all feature-value pairs as language specific. I did not test it though. The script does not run on my system at all; I suspect that it requires python 2.