UniversalDependencies / UD_Dutch-Alpino

Dutch data.
Creative Commons Attribution Share Alike 4.0 International
8 stars 1 forks source link

Updating UD_Dutch #1

Open gossebouma opened 6 years ago

gossebouma commented 6 years ago

I have more or less finished an alternative annotation for this treebank, using the annotation of this material as it currently is in our (Groningen, Alpino) treebanks, and using the same conversion script as was used to create UD_Dutch_LassySmall.

Bonuses: consistency between the two Dutch treebanks, issues with information-loss due to using original CONLL material are avoided, source of material is known (ie the UD_Dutch corpus is a mix of rather different corpus material), and easier maintenance in the future.

A (non-exhaustive) list of differences between the V2.0 version and this new conversion is here: here

However, I am somewhat reluctant to simply upload this rather different version of the material (note that tokenization also differs slightly in places, that sentence segmentation differs in a few rare cases, and that the sources of some sentences could not be located).

Any advice on how to proceed?

jnivre commented 6 years ago

Personally, I am confident that this will be an improvement, so I would say you should just upload it. (Older versions can always be retrieved from old releases.) What do you say, @dan-zeman?

dan-zeman commented 6 years ago

Agreed. The old version had numerous problems, including tokenization; so I would not try to stay compatible with it.

gossebouma commented 6 years ago

I have now uploaded a first version of the updated material. For what it is worth: it passes the checks performed by the udapy script (which I use to re-attach punctuation according to UD guidelines).

Some remaining issues:

@dan-zeman suggestions?

After reconsidering (and upon advice from @gertjanvannoord) I have decided to leave the train/dev/test split as is

martinpopel commented 6 years ago

http://universaldependencies.org/validation.html still show validation errors for UD_Dutch, mostly "Mismatch between the text attribute and the FORM field". If the raw sentence in the text comment is correct, some/most of these bugs (e.g. different quotation marks) can be solved with udapy -s ud.ComplyWithText < nl-ud-dev.conllu > fixed-dev.conllu

Note also that according to the guidelines, we should use SpaceAfter=No, not SpaceAfter=no (this will be fixed by ud.ComplyWithText as well).

I did https://github.com/UniversalDependencies/tools/commit/400a0521a0aa0c595

gossebouma commented 6 years ago

One big source of mismatches is leading and closing double quotes. For some reason, later versions of this material have different quotation styles. Another frequent difference (in nl-ud-dev) is news stories that start with a location. This location (not really part of the first sentence) is removed in later versions. Finally, some spelling and grammatical errors have been fixed.

A clean solution is to simply insert the current version of the text as it is in our treebanks (rather than the original UD 2.0 string) and assume that the editors of that material made wise decisions. It would make maintenance a lot easier I think.

Maybe @gertjanvannoord has something to add?

SpaceAfter=No issue is easily fixed.

dan-zeman commented 6 years ago

some sentences could not be located in Alpino treebank. For those I preserved the UD 2.0 annotation and added a #WARNING Alternative: remove those altogether

I did not find this warning in the data. Has it possibly been rephrased as "WARNING no matching treebank file"? I do not know how difficult it is to check their annotation and make it compatible with the rest but if the annotation has to be taken from UD 2.0, then removing those sentences might actually be a better solution, especially if it is just a few dozens sentences, out of more than 13K sentences.

As for the tokenization mismatches, you can drop the text attribute from UD 2.0 and insert your own. The UD 2.0 text was generated from the FORM values of individual tokens, and detokenized using simple heuristics; I did not have the original detokenized text (any version of it) available. Being able to distinguish leading from closing quotes is actually an improvement of your data (although I guess I would prefer to render them as the actual Unicode points for the quotes rather than the ASCII-encoded ,, and '').