UniversalDependencies / UD_English-GUM

Other
30 stars 4 forks source link

Hyphens in wikification should be escaped #40

Closed martinpopel closed 2 years ago

martinpopel commented 2 years ago

In UDv2.9, all the GUM files use global.Entity = entity-GRP-infstat-MIN-coref_type-identity suggesting there will be 6 attributes in each Entity.

First, there is a question what to do when the identity aka wikification is missing - it may be easier for parsing to always require 6 attributes and keep the wikification as empty string, i.e. end the Entity with a hyphen. But the current practice (there can be just 5 attributes) is acceptable as well.

However, Entity=(abstract-182-new-6-coref-Pearson's_chi-squared_test should be converted to Entity=(abstract-182-new-6-coref-Pearson's_chi%2Dsquared_test, I think.

amir-zeldes commented 2 years ago

Absolutely, thanks for catching this! The unescaped hyphens are definitely a bug (the option for ignoring unused trailing features is by design, to promote compactness). Oddly other reserved characters are escaped correctly, such as parentheses:

https://github.com/UniversalDependencies/UD_English-GUM/blob/master/not-to-release/sources/GUM_academic_games.conllu#L41

But somehow hyphens slipped through. I can fix this, but unless @dan-zeman thinks this bug is serious enough to warrant a patch, then it will stay in the dev branch until the next UD release in May.

dan-zeman commented 2 years ago

There are many bugs and very few are "serious enough" :-) Stay in dev, May is not so far ahead.

amir-zeldes commented 2 years ago

OK, this should be fixed in dev now.