barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

beni odd coverage #647

Closed bodritto closed 10 years ago

bodritto commented 10 years ago

http://eu.barzer.net/translate?&key=BHjFDiC0QdoyDF7DBVn1rLWu0LaKRi8QeKiVSSSW&query=%D0%BA%D1%80%D0%B0%D1%81%D0%BD%D1%8B%D0%B9%20%D1%87%D0%B0%D0%B9%D0%BD%D0%B8%D0%BA

data: eu.barzer.net, user 1000106 (ev_all)

0xd34df00d commented 10 years ago

What exactly is odd?

bodritto commented 10 years ago

Радиоприемник SUPRA PAS-6277 красный/черный

query = "красный чайник"

cov = 0.715624

0xd34df00d commented 10 years ago

Around 11 of 14 ngrams match, giving the score pretty close to the observed.

Ный gram matches twice :(

bodritto commented 10 years ago

why this one: "Радиоприемник HYUNDAI H-1549 черный/красный" doesn't have same coverage?

0xd34df00d commented 10 years ago

Good question. I'm already on it.

0xd34df00d commented 10 years ago

The following 14 features are generated (nevermind the numbers, they are feature IDs): кра:37 рас:38 асн:39 сны:40 ный:33 ый:34 й ч:59 ча:60 чай:61 айн:62 йни:63 ник:11 кр:36 ик:12

Seventh feature, й ч, is only generated for SUPRA shit, cause of красный/черныйкрасный черный → having the й ч gram. черный/красный doesn't generate it on the other hand.

In other words it's expected behavior.

barzerman commented 10 years ago

this cant be the expected behavior красный чайник != красный/черный and the ngram overlap is nowhere near 71%

0xd34df00d commented 10 years ago

How did you compute it as nowhere near 71%?

Also there is a single ngram matching there, you can't avoid it without loosing / as a separator.

barzerman commented 10 years ago

красный чайник != красный черный either you shouldnt be double counting "ный" obviously

0xd34df00d commented 10 years ago

It isn't double-counted.

0xd34df00d commented 10 years ago

Also, I dont argue they're equal.

barzerman commented 10 years ago

красный чайник кра,рас,асн,ный,ый ,й ч,чай,айн,йни,ник,ик ; - 11 3grams черный/красный чер,ерн,рны,ный,ый ,й к,кра,рас,асн,ный,ый ; 11 3grams overlapping 5: кра,рас,асн,ный,ый

so 5 ngrams ouе of 11 overlap and the score is 71% - how is that expected behavior? the expectations must be wrong

0xd34df00d commented 10 years ago

Please reread the source result. Also, you've missed a lot of ngrams. 19.11.2013 23:15 пользователь "barzerman" notifications@github.com написал:

красный чайник кра,рас,асн,ный,ый ,й ч,чай,айн,йни,ник,ик ; - 11 3grams черный/красный чер,ерн,рны,ный,ый ,й к,кра,рас,асн,ный,ый ; 11 3grams overlapping 5: кра,рас,асн,ный,ый

so 5 ngrams ouе of 11 overlap and the score is 71% - how is that expected behavior? the expectations must be wrong

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/647#issuecomment-28822744 .

barzerman commented 10 years ago

we have an obviously shitty result in this specific case. even if it checks with the current algorithm it only means that the current algorithm is shitty in this specific case and needs to be changed so that this shitty case isn't shitty anymore. it is a real problem which needs to be solved

0xd34df00d commented 10 years ago

Then we'll probably break a lot of other cases, and we don't have a proper way of testing this (and we can hardly have one, since it's a purely expert-based thing).

I'm strongly against fitting our alrorithm to such corner cases.

barzerman commented 10 years ago

@pltr please look ASAP at the translator

0xd34df00d commented 10 years ago

Closing in favor of #648.

bodritto commented 10 years ago

по "красный чайник" не находится Чайник Lamark LK-1006 красный но находится Радиоприемник SUPRA PAS-6277 красный/черный

красный чайник -> радиоприемник http://eu.barzer.net/bjson?&key=BHjFDiC0QdoyDF7DBVn1rLWu0LaKRi8QeKiVSSSW&query=%D0%BA%D1%80%D0%B0%D1%81%D0%BD%D1%8B%D0%B9%20%D1%87%D0%B0%D0%B9%D0%BD%D0%B8%D0%BA

красТный чайник -> красные чайники http://eu.barzer.net/translate?&key=BHjFDiC0QdoyDF7DBVn1rLWu0LaKRi8QeKiVSSSW&query=%D0%BA%D1%80%D0%B0%D1%81%D1%82%D0%BD%D1%8B%D0%B9%20%D1%87%D0%B0%D0%B9%D0%BD%D0%B8%D0%BA

barzerman commented 10 years ago

your list of ngrams has an obvious problem: for красный чайник you have кра рас ... ча чай you treat space different from start of phrase @0xd34df00d

0xd34df00d commented 10 years ago

After talking with @inggris I finally got the issue. I agree some shit is going on.

barzerman commented 10 years ago

Проблемная ситуация по запросу красный чайник НЕ НАХОДИТСЯ id 44404 name Чайник Lamark LK-1006 красный но находится id 63595 name Радиоприемник SUPRA PAS-6277 красный/черный

а) нужно ответить на два вопроса: 1) какой coverage у "Чайник Lamark LK-1006 красный" 2) почему он вообще не попадает в выдачу и почему он так сильно меньше чем у "Радиоприемник SUPRA PAS-6277 красный/черный" б) починить

barzerman commented 10 years ago

fix deployed to venik and eu machines. fixed