ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.15k stars 210 forks source link

Zemberek creates duplicate WordAnalysis results #81

Open ssaltin opened 7 years ago

ssaltin commented 7 years ago

For the input "yoksa" Zemberek generates 6 WordAnalysis, which contains 2 duplicate results, bold ones are duplicate

0 = {WordAnalysis@6912} "[(yoksa:yoksa) (Conj)]" 1 = {WordAnalysis@6913} "[(yok:yok) (Adj)(Verb;Cond:sa+A3sg)]" 2 = {WordAnalysis@6914} "[(yok:yok) (Adj)(Verb;Cond:sa+A3sg)]" 3 = {WordAnalysis@6915} "[(yok:yok) (Noun;A3sg+Pnon+Nom)(Verb;Cond:sa+A3sg)]" 4 = {WordAnalysis@6916} "[(yok:yok) (Adj)(Noun;A3sg+Pnon+Nom)(Verb;Cond:sa+A3sg)]" 5 = {WordAnalysis@6917} "[(yok:yok) (Adj)(Noun;A3sg+Pnon+Nom)(Verb;Cond:sa+A3sg)]"

image

ssaltin commented 7 years ago

As I realized now, their dictionary items are different:

yok [P:Adj; A:NoVoicing] yok [P:Adj; A:Voicing]

But still aren't they include same morphological result?

ahmetaa commented 7 years ago

Thanks, I am aware of this problem and should be fixed in next version hopefully.

mdakin commented 6 years ago

0.12 still creates double results. did not test with 0.13

Input: yoksa yoksa [yoksa:Conj] yoksa:Conj [yoksamak:Verb] yoksa:Verb+Imp+A2sg [yok:Adj] yok:Adj|Zero→Verb+sa:Cond+A3sg [yok:Adj] yok:Adj|Zero→Verb+sa:Cond+A3sg [Yok:Noun,Prop] yok:Noun+A3sg|Zero→Verb+sa:Cond+A3sg [yok:Noun] yok:Noun+A3sg|Zero→Verb+sa:Cond+A3sg Disambiguation result: [yoksa:Conj] yoksa:Conj

ahmetaa commented 6 years ago

0.13.0 also produces double results for this. Because voicing attribute is optional for "yok" when constructing graph, two stem transitions are created for "yok". And for inputs like "yoktan" or "yok" paths passing from both stem transitions successfully terminates.

One possible solution for those words, reference attribute can be used. For example:

yok [P:Adj; A:Voicing] yok [P:Adj; A:NoVoicing, Ref:yok_Adj] ---> pointing first one

And after analysis, if morphemes are equal and both referenced item and item exists, one can be deleted. This can be done as a post processing operation.