UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Validation script: Mismatch between the text attribute and the FORM field error. #413

Closed sebschu closed 7 years ago

sebschu commented 7 years ago

The validation script seems to check that the text attribute is equal to the concatenation of the form values (respecting the SpaceAfter attribute) but this leads to many validation errors in the English treebank. Our FORM values are based on the output of the PTB tokenizer but I extracted the sentences from the original untokenized text (which I and @manning strongly believe is the right thing to do), which leads to some differences.

For example, we have "entree" in the form field but "entrée" in the original document, which leads to a mismatch and consequently to a validation error.

Can we update the script so that it won't fail on examples like that? A warning might still be useful but an error doesn't make sense here, in my opinion.

jnivre commented 7 years ago

Why does the tokenizer remove the accent? Does the PTB tokenizer produce plain ascii? I can see how it makes sense in the monolingual context to retain compatibility with existing tools, but it is more doubtful in the multilingual context if we want to encourage people to come up with solutions that generalize well across languages.

martinpopel commented 7 years ago

Would not it be better to fix the forms, i.e. change "entree" to "entrée"? This could be done automatically. I can help with this if needed.

fcbr commented 7 years ago

We were also hit by this, but mostly due to SpaceAfter and multi-word token issues. A few cases there was case discrepancy.

There are a few cases where there are typos in the original text (which we want to preserve, of course), and our tentative decision is to preserve them in the form, but keep the correct lemma.

sebschu commented 7 years ago

Yes, the PTB tokenizer produces plain ascii.

I'm a bit uncertain what to do about this. From a pure UD-perspective, it would of course be better to update the forms. But given how widespread the use of the PTB tokenizer (or variants of it such as our CoreNLP tokenizer) is, it also seems very problematic to change the tokenized output. And we would also lose the correspondence with the constituent trees by the LDC.

I'll investigate how big of an issue this actually is -- it might just be a handful of tokens anyways.

jnivre commented 7 years ago

In general, I think the approach of doing normalisation at the lemma level is the right one. For example, if you think that "entrée" and "entree" are alternative word forms of the same lemma, and if you think the normalised form should be "entree", you can put this into the lemma. As regards the PTB standard, we have already departed from it by using "(" instead of "-LRB-", and I honestly think that sticking to ASCII in 2017 would be a serious mistake. Do you really want to lose all the diacritics in names, for example?

sebschu commented 7 years ago

After thinking more about this, I agree, we probably shouldn't cling on a standard from 1993, and instead update the FORM values and potentially keep the standardized values as lemmata. In the short term, this will make the data useless for most English pipelines but hopefully in the long run, people (including us ;)) will adapt. I'll just wait for @manning to comment on this before going ahead; he has thought a lot about this issue and might have some additional things to say that I haven't considered.

Also, my example was not very representative as words with diacritics are super rare. However, non-ASCII punctuation marks such as some types of quotation marks or apostrophes appear very frequently in the corpus and they are the reason for most of the validation errors.

martinpopel commented 7 years ago

this will make the data useless for most English pipelines

but a bit more useful for all the real-world application (unless we consider PennTB science a real application).

manning commented 7 years ago

Hey, I'm good with updating the FORM values to what was in the original text, providing @sebschu is game to do the work (he indicated to me that he was)!

Long term, it does make sense to me to move to original, Unicode token FORMs.

The compatibility issues are non-trivial, though. When we started off with the Web Treebank, our decision was to keep the tokens and sentence breaks the same as in the LDC English Web Treebank constituency parse release, thinking that that level of compatibility was useful. And, as noted, the LDC still makes tokens ASCII, complete with "LaTeX-style" double quotes `` and '' in all of their English treebanks. This decision gave a good compatibility not only with the constituency treebank but 25 years of English NLP tools, which expect that kind of tokenization. We'd now be losing that.

@martinpopel: I think your comment misses where the lossage occurs: At present, pretty much all English NLP tools (whether our own CoreNLP or others like NLTK, Spacy, Charniak parser, OpenNLP, DKPro Core, etc., etc.) contain a tokenizer that converts (more or less accurately) surface English text into the ASCII tokens that the LDC uses, so that you get tokens that work well with any of the tools trained on any of Penn Treebank, OntoNotes, Web Treebank, etc. Such tokens would now be wrong for tools trained on UD English and, if your pipeline assumes a single tokenization, it will be harder to combine tools from different sources (for instance, POS and dependencies from UD English and named entities from OntoNotes).

As @jnivre points out, we've already deviated from that by not escaping ( and ) but that's in a pretty trivial and easily reversible way. This would now be a more significant deviation, which couldn't be fixed with just two regex global substitutions.

Nevertheless, long term this direction is good. People use emoji a lot now. 😉 It certainly raises the question that if we're not keeping the tokens the same, is there any good reason to keep the sentences the same. There are also several clear places where the LDC sentence splitting was also done wrongly. Maybe we should fix those too at some point. But it probably won't happen by Feb 15.

sebschu commented 7 years ago

We do use now Unicode characters in the FORM values and (with the exception of punctuation marks) also in the LEMMA values.