ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.14k stars 207 forks source link

How to get better result with Morphological Disambiguator ? #193

Closed gokhanakgul closed 5 years ago

gokhanakgul commented 5 years ago

Hi ,

I am still learning the usage of the zemberek-nlp. I made some test with Default Turkish Morphology 's disambiguate method. Also I made same test with ITU MorphDisambiguator service. You will find both results of them at below.

My question is , How do I improve quality of disambiguate export of the zemberek-nlp library ? Can you provide a little bit information about it?

Thanks

"Seksen yılı aşkın bir süredir kullanmakta olduğumuz Latin kaynaklı Türk yazısıyla kimi sorunlara karşın artık yazımın gelenekleştiğini söyleyebiliriz."

zemberek-nlp result is ;

[seksen:Num,Card] seksen:Num [yıl:Noun,Time] yıl:Noun+A3sg+ı:Acc [aşk:Noun] aşk:Noun+A3sg+ın:Gen [bir:Det] bir:Det [süre:Noun] süre:Noun+A3sg|Zero→Verb+Pres+A3sg+dir:Cop [kullanmak:Verb] kullan:Verb|mak:Inf1→Noun+A3sg+ta:Loc [olmak:Verb] ol:Verb|duğ:PastPart→Adj+umuz:P1pl [Latin:Noun,Prop] latin:Noun+A3sg [kaynak:Noun] kaynak:Noun+A3sg|lı:With→Adj [Türk:Noun,Prop] türk:Noun+A3sg [yazı:Noun] yazı:Noun+A3sg+sı:P3sg+yla:Ins [kimi:Det] kimi:Det [sorun:Noun] sorun:Noun+lar:A3pl+a:Dat [karşın:Postp,PCDat] karşın:Postp [artık:Adv] artık:Adv [yazım:Noun] yazım:Noun+A3sg+ın:Gen [gelenek:Noun] gelenek:Noun+A3sg|leş:Become→Verb|tiğ:PastPart→Noun+A3sg+in:P2sg+i:Acc [söylemek:Verb] söyle:Verb|yebil:Able→Verb+ir:Aor+iz:A1pl [.:Punc] .:Punc

Export of [ITU MorphDisambiguator (http://tools.nlp.itu.edu.tr/MorphDisambiguator)

Seksen seksen+Adj+Num yılı yıl+Noun+A3sg+P3sg+Nom aşkın aşkın+Adj bir bir+Adj+Num süredir süre+Noun+A3sg+Pnon+Nom^DB+Verb+Zero+Pres+A3sg+Cop kullanmakta kullan+Verb+Pos+Prog2+A3sg olduğumuz ol+Verb+Pos^DB+Adj+PastPart+P1pl Latin latin+Noun+A3sg+Pnon+Nom kaynaklı kaynaklı+Adj Türk Türk+Noun+NAdj+A3sg+Pnon+Nom yazısıyla yazı+Noun+A3sg+P3sg+Ins kimi kim+Pron+Ques+A3sg+Pnon+Acc sorunlara sorun+Noun+A3pl+Pnon+Dat karşın karşın+Postp+PCDat artık artık+Adverb yazımın yaz+Noun+A3sg+P1sg+Gen gelenekleştiğini gelenekleş+Verb+Pos^DB+Noun+PastPart+A3sg+P3sg+Acc söyleyebiliriz söyle+Verb+Able+Pos+Aor+A1pl

mdakin commented 5 years ago

As far as I see disambiguation is not really bad for this example, at first glance "aşkın" seems to be wrong for zemberek. "kimi" is wrong for ITU , not sure about "kullanmakta". Ahmet can give more information about it.

ahmetaa commented 5 years ago

Improving disambiguator takes time. It requires lots of annotated training sentences. Attacking highly ambiguous words is an idea we think worth following. But this will take time.

I did not prepare a document on how we train the disambiguator, I will try writing a short document hopefully soon. We use available data (From Haşim Sak's work) and combine it with the data we generate with a primitive rule based disambiguator. One problem we face is that most data available in the field uses Oflazer's analyzer and Zemberek is not fully compatible with it. So we lose information when trying to convert it.

There is an interesting work from Koç University (https://arxiv.org/pdf/1805.07946.pdf) that uses neural networks trained with unambiguous data. After that network can perform morphological analysis and disambiguation without any rules successfully. In the paper they say more data will be published named TrMor2018. But it is not yet available. We can use that data to improve Zemberek's disambiguator, and perhaps implement that neural system in Zemberek as an alternative to the rule based morphology and perceptron disambiguator later.

Thanks for your feedback, we would like to know your opinions on the project.

ahmetaa commented 5 years ago

Also, as @mdakin said result is not bad for Zemberek. Unfortunately there is no good way of evaluating this system against ITU system since morpheme and POS data are not fully compatible.

gokhanakgul commented 5 years ago

Hi I am still digging the code and trying to understand usage. I have to say that your effort is so valuable in this field. The branch is so wide and including detailed too many topics. I have great much respect for your effort.

But my curiosity is forcing me ask much more questions. if you provide some documentation about usage and way to train the disambiguator , I wil be glad a lot .

Best Regards

ahmetaa commented 5 years ago

I am closing this issue for now. Improving disambiguation will hopefully happen gradually.