barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

beni cov problem #539

Closed nchepanov closed 11 years ago

nchepanov commented 11 years ago

talk to yanis for details

http://eu.barzer.net/~yanis/evtest/#6007
http://eu.barzer.net/~yanis/evtest/#RMC-M11
http://eu.barzer.net/~yanis/evtest/#Gorenje%20EC
http://eu.barzer.net/~yanis/evtest/#320
http://eu.barzer.net/~yanis/evtest/#M70
...

and much more short queries with cov=1 results

nchepanov commented 11 years ago

one more eaxmple : 630 720

nchepanov commented 11 years ago

also: beni is not even being activated in 37 query. It is probably too short, but it's real.

0xd34df00d commented 11 years ago

We have penalty code commented out in the repo, and I think that prefering shorter results to longer ones given the same coverage is exactly what's needed in this issue.

nchepanov commented 11 years ago

@pltr, according to the issue -- results with the same cov must be sorted from short to long. Should I create issue about that ?

barzerman commented 11 years ago

what is the status of this

pltr commented 11 years ago

well translator currently sorts beni results by (1 - cover, len(name)) on my side

nchepanov commented 11 years ago

well this doesn't solve the problem in some cases http://eu.barzer.net/~yanis/evtest/#9220 http://eu.barzer.net/~yanis/evtest/#karcher%20sc but in most cases it does

barzerman commented 11 years ago

this problem has nothing to do with sorting by length. the penalty should be on INCOMPLETE WORDS not on length

0xd34df00d commented 11 years ago

I re-enabled side ngrams generation that was disabled looooooong ago, and...

Penalty on INCOPMLETE WORDS would kill perfectly legitimate query. Thanks to query compaction фен ровента from the first link turns into фен rowenta cf9220, where 9220 is already part of the word and not a whole word and would be penalized.

I think it's a dilemma — we either normalize cf 9220 to cf9220 and lose whole words penalty or we don't normalize and thus can know whether words are truly complete.

Otherwise this fucking rowenta thing will get a 0.75 coverage instead of 1 for query 9220.

0xd34df00d commented 11 years ago

Is this still an issue? Please triage.