ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.15k stars 210 forks source link

Weird case of disambiguation: kadarıyla #21

Closed faraday closed 8 years ago

faraday commented 10 years ago

This might lead us to a more general error. Submit these queries to disambiguator in this order.

Query: kadarıyla

kadarıyla
Sentence  = kadarıyla
Before disambiguation.
Word = kadarıyla
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]

After disambiguation.
Word = kadarıyla
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]

Query: KADARIYLA

Sentence  = KADARIYLA
Before disambiguation.
Word = KADARIYLA
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]
[(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)]

After disambiguation.
Word = KADARIYLA
[(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)]
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]

Query (again): kadarıyla

Sentence  = kadarıyla
Before disambiguation.
Word = kadarıyla
[(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)]
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]

After disambiguation.
Word = kadarıyla
[(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)]
[(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)]

Properly sanitizing the input is a workaround for this case. I didn't include full sentences in this bug report since the response stays the same for this word, demonstrating this behavior.

mdakin commented 10 years ago

[(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)] This looks wrong anyway.

AFAIK current disambiguators were made as experiments and they are not trained very well (because finding good training data is very difficult) Ahmet is working on disambiguation but I am not aware of the details yet.

Thanks for the report.

faraday commented 10 years ago

I think that parse is generated by a pragmatic assumption. In regular Turkish text, any word can be converted to a proper noun by capitalizing the first letter (or the whole word).

Another example supports that this is an intended effect:

Word = Xmfkoad
[(Xmfkoad:xmfkoad) (Noun,Prop;A3sg+Pnon+Nom)]

Word = Maximus
[(Maximus:maximus) (Noun,Prop;A3sg+Pnon+Nom)]

I don't think a Turkish reader would discard this meaning so easily in a sentence. The ability to discard this directly comes from knowledge of other probable contexts. Yet, still, one could argue that this is the name of an imaginary name from a Turkish book.

The effects of this pragmatic strategy could be further demonstrated with the examples below:

Word = şemseddin
[(UNK:şemseddin) (Unk,Unk;Unkown)]
Word = şemseddin'e
[(Şemseddin:şemseddin) (Noun,Prop;A3sg+Pnon+Dat:e)]

Word = hhhh'de
[(Hhhh:hhhh) (Noun,Prop;A3sg+Pnon+Loc:de)]

First example looks okay. When indeed şemseddin is not a known word, the parser could use the suffix -e to deduce it's a proper noun.

The second example demonstrates where this strategy seems to fail.

A more fundamental question is: As a reader, would we not accept Hhh or HHH as an imaginary name in a Turkish book?

mdakin commented 10 years ago

Well IMHO we want to be pragmatic and reduce morphological parse candidates, we can actually eliminate the words specifically marked as non proper nouns in the dictionary. So kadariyla should not have been parsed as a proper noun (because we know it is definitely not?) For others, like Maximus, Semseddin or HHHH, I agree, a proper noun parse is acceptable.

But maybe Ahmet has more comments, I am more of a kibitzer and probably full of crap as well, just observing the project, shamefully not doing real work on it for very long time. Unfortunately Ahmet does not have github or gmail access from his "super secure awesome" workplace.

ahmetaa commented 10 years ago

@faraday I don't think there is a clear solution to this problem. In general, I am in line @mdakin on this. I think this behavior is a fairly good compromise for sake of practical usage.

Without contextual analysis, it is not possible to determine if a word like "Elllma" is actually a proper noun, or word with a typo. A twitter normalization application may prefer to convert it to word "Elma", but it may not be the desired action.

So at the end, I am open to suggestions. But there are several things can be done to improve the current behavior:

1- Currently dictionary contains a limited amount of proper nouns. And those proper nouns are parsed even if they are written small case and without quote symbol. Such as

ankara
[(Ankara:ankara) (Noun,Prop;A3sg+Pnon+Nom)]
ankaraya
[(Ankara:ankara) (Noun,Prop;A3sg+Pnon+Dat:ya)]

We can add more proper nouns (like Turkish person names) so that it accepts words like şemseddin

2- Furthermore, if parser tries to infer types of words like "Maximus", user should know that this does not come from the dictionary. So if user wants he can ignore the deduced type and process the word however he/she likes.

I will open and issue about the second improvement and leave this issue open because KADARIYLA or Kadarıyla should have been disambiguated correctly.

faraday commented 10 years ago

@ahmetaa @mdakin I also think this pragmatic approach offers a good compromise. This proper noun assumption has the potential to become hugely valuable with alternative disambiguator implementations.

As you pointed out in the source code, the following section is also related with the Z3MarkovModelDisambiguator : Source: Hakkani-Tür, Dilek Z., Kemal Oflazer, and Gökhan Tür. "Statistical morphological disambiguation for agglutinative languages." Computers and the Humanities 36.4 (2002): 381-410

If we consider just the morphological features and ignore any (lexical) semantic features (e.g., the proper noun marking) that we mark in morphology, the accuracy increases a bit further. These stem from two properties of Turkish: Most Turkish root words also have a proper noun reading, when written with the first letter capitalized. We count it as an error if the tagger does not get the correct proper noun marking, for a proper noun. But this is usually impossible especially at the beginning of sentences where the tagger can not exploit capitalization and has to back-off to a lower-order model. In almost all of such cases, all syntactically relevant morphosyntactic features except the proper noun marking are actually correct.

In this article, this parsing result is indeed considered as an error. However providing this set to the disambiguator:

(Kadarıyla:kadarıyla) (Noun,Prop;A3sg+Pnon+Nom)
(kadar:kadar) (Postp)(Noun;A3sg+P3sg:ı+Inst:yla)

should be completely fine for the lower levels.

You're completely right with (2). Users should decide to throw out alternative interpretations when they receive results from unknown territories.

I'm inclined to think this will become just a documentation issue in further releases.