Closed tknorr closed 6 years ago
I do not know NeuroCollective (and Google is not of much help here) but if you can fix errors in the data, it is welcome! Please submit the changes as a pull request against the dev
branch, not the master
branch.
Will do, give me about a week, I can indicate what I would consider tagging defects based on NC knowledge vs. treebank knowledge. I need to figure out how to best translate that list into edits in the treebank. The data cleanup should have been done before it was processed into the treebank. I suspect there is more data around since google's book scans are all full of Fraktur OCR issues and the TreeTagger has some mis-categorization issues that probably don't change. You see the same issues in the German N-Grams. For his specific issue you can see the the NC as a translation memory database. It is a good independent opinion to any algorithm based POS tagger. Any inconsistencies of the markup are a good indication that there is something wrong in one or the other. I have to think about if it is easier to apply changes (guess that depends on the number of bugs) or write code to simply retag the original text with the NC after correcting it.
I was not aware that TreeTagger had been used on the German treebank. I thought it was the result of manual annotation. It would probably be a good idea to run this by @slavpetrov.
It looks like TreeTagger was the original tagger. Some of the UD-Tags are different from the TreeTagger, so it is quite possible that it got reviewed.
I believe that UPOSTAGS, which come from the original annotation by Google, were assigned manually. I later applied TreeTagger to get the features, lemmas, and, as a side-effect, also the XPOSTAGS. Some rule-based heuristics were applied on top of that. But I think I did not touch UPOSTAGS. Definitely not in the TreeTagger part of things.
BTW, pretty much the same process has been applied to UD Spanish. Only the tagger was not TreeTagger there.
One thing I see pretty consistently is that adjectivized proper names are tagged in the UDPOSTAGS as PROPN and tagged in the TreeTagger correctly as ADJA. Example: 'Leopoldskron' is a place and deserves a PROPN, a 'Leopoldskroner' could be a noun in the right context and designates a person from 'Leopodlskron' (NOUN). The term 'Leopoldskroner' can also be an ADJ, as in the example of the 'Leopoldskroner Schloss' the castle being located in/belonging to/ ... Leopoldskron. I think for training SyntaxNet that might be relevant. A lot of non-German words have been tagged as PROPN. That is probably just noise and O.K., but I am thinking for the training part to select as little as possible text with non-German words in it - but then, maybe it is good to have that noise in it.
If the non-German words are part of named entities (such as Google Maps or Piazza Italiana) then it seems OK to tag them PROPN
(even if in a similar German named entity they would not get that tag). Otherwise I would tag them X
.
On the other hand, I agree that in Leopoldskroner Schloss, Leopoldskroner should be ADJ
. And even worse, in Best Western Hotel des Nordens, all words are tagged PROPN
but only the first two really deserve it, and definitely the (genuinely German) article des must be DET
. This sort of error appeared in several datasets from the Google UDT legacy. I think we have sort of fixed them in French and Spanish, and we should also fix them in German.
Actually... that is a proper noun. Best Western is the brand and Hotel des Nordens is the name of the hotel - which makes for a composite proper noun, doesn't it? See their website.
UD does not care about multi-word named entities. PROPN
is not (should not be) used automatically whenever a word occurs inside of a multi-word named entity. Personally I think that foreign named entities are an exception because it is better to use PROPN
than just X
. But once it is German inside the German treebank, the normal part of speech has higher priority. Hotel is NOUN
, des is DET
, Nordens is NOUN
.
One exception where I think our usual approach is to favor PROPN
over NOUN
is when the word is a personal name: in Herr Nord or Herr Hotel :-) I would tag the second word PROPN
. But then the word itself becomes a name, it's not "just a part of multi-word named entity". And if it were Herr von Nord, von would be ADP
as usual. Just my 2c.
I agree with Dan, in principle, but I think it has to be qualified by the use case. If you are looking for a syntactic parsing, the individual components of the sentence need to be tagged in their roles. It should be possible to tag 'Hotel' as NOUN as well as 'PROPN' and a later process of named entity recognition should eliminate the not applicable tagging. GATE has some YAPE rulings for that, but I think a neural network should be able to learn that if both tags are provided but the training set has picked only one alternative.
The crucial point is that UD annotation is syntactic and does not include named entity annotation. Hence, PROPN should only be used for words whose primary use is to function as names, which in particular includes person names and (many) place names. For people who want to do joint syntactic and named entity analysis, the named entity annotation has to be added in a separate layer.
I hate to be such a spoiler but the next problem is the actual text. There are some spelling errors, which I figured do not really matter for a pure POS tagging application e.g. 'würklich' should be 'wirklich', 'Bröchten' should be 'Brötchen' since the context is a bakery. A bigger problem is mis-spelling in articles, which will inevitably lead to markup problems. E.g. ... besonders auf dem Kunden eingeht... is the wrong grammatical case, should be .. auf den Kunden... What is it that we are trying to achieve with this data? I think the best approach to fix this is to rebuild it from scratch. Spell-check/review the text, then re-tag it then maybe diff it to the original and review the differences.
If the "errors" occur in the original text, they should not be "fixed". A very important purpose of collecting and annotating naturalistic data is to be able to train annotators that are robust to spelling mistakes and grammatical errors. The point is that we as humans can easily interpret these texts despite the problems, and we want to be able to build machines that have the same capacity. If you specifically want to study these phenomena, it is possible to add additional annotation on top of the treebank, but we definitely don't want to alter the original texts before annotating them.
Ok, glad I asked.
If anyone is working with this, there are a lot of problems with the German deposit. Some are because of known tagging problems with the TreeTagger (mis-tagging), some are because the input text hyphenation was not removed before the tagging. Looks like it even has some OCR Fraktur mis-spellings. I can run the files against the NeuroCollective which can mark-up and correct some of the issues, then update the files. Check back in a week.