barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

beni wrong cov - 3D телевизор #514

Closed nchepanov closed 11 years ago

nchepanov commented 11 years ago

3D телевизор

<beni>
    <entity id="26415" n="DVD BBK DVP752HD темно-серый" r="0" class="1000206" subclass="7" lev="27"
cov="0.2"/>
    <entity id="25106" n="3D очки для плазменных телевизоров GRUNDIG AS-3D-G" r="0" class="1000206" subclass="7" lev="56"
cov="0.9"/>
    <entity id="19189" n="3D очки для плазменных телевизоров PANASONIC TY-EW3D10E" r="0" class="1000206" subclass="7" lev="62"
cov="0.9"/>
    <entity id="290" n="Подставка для телевизора PANASONIC TYST50D2" r="0" class="1000206" subclass="7" lev="46"
cov="0.8"/>
    <entity id="11893" n="Комплект телескопических направляющих KUPPERSBERG Комплект телескопических направляющих" r="0" class="1000206" subclass="7" lev="0"
cov="0.3"/>
....
  </beni>

two full match from different ends of phrase shold not produce cov = 0.9

nchepanov commented 11 years ago

meanwhile: ECMPCV40

<beni>
    <entity id="33683" n="Микрофон SONY ECM-PCV40" r="0" class="1000206" subclass="7" lev="15" 
cov="0.666667"/>
    <entity id="38137" n="Мультиварка SMILE MPC-1140" r="0" class="1000206" subclass="7" lev="17" 
cov="0.166667"/>
...

disambiguator sort out beni results with cov < 0.7 in this case cov is too low

0xd34df00d commented 11 years ago

By definition coverage is how much ngrams from source query is found in the result, this way we can have something that's normalized and sane. Meanwhile dash (in the first comment) kills a considerable amount of ngrams from the query, hence the low result.

I think it's better to normalize the strings instead.

barzerman commented 11 years ago

Gosha, strings should already be normalized pls investigate asap

Sent from my iPhone

On Mar 26, 2013, at 6:42 AM, Georg Rudoy notifications@github.com wrote:

By definition coverage is how much ngrams from source query is found in the result, this way we can have something that's normalized and sane. Meanwhile dash (in the first comment) kills a considerable amount of ngrams from the query, hence the low result.

I think it's better to normalize the strings instead.

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15451197 .

nchepanov commented 11 years ago

is it possible to take into consideration relative position of ngrams? in your case with query "a b c" cov( a x y z q u b o p r s t c) = 3 and cov(x y z q u a b c) = 3 as well

I think this is wrong

0xd34df00d commented 11 years ago

This is covered by longer ngrams, but with longer ngrams we lose more coverage in case of typos (like in the example with dash).

Positions by themselves make ngrams unstable to, well, relative positions and shifts.

0xd34df00d commented 11 years ago

Turned out longer ngrams suck as well, effectively lowering the search quality.

Your example is perfectly valid, but imagine you search for abcde in a string like " ab zz bc zz cd zz abc zz de " (where ngrams occur several times, and relative positions may vary depending on the subset you take). In this case it's a combinatorial task that can't be solved in reasonable time without some simplifying assumptions that will surely be broken in some queries.

barzerman commented 11 years ago

we need to depress coverage for cases when matching ngrams have "holes" between them in other words if input sequence is abcd a xxxx b xxxx c xxxx d should produce significantly smaller coverage than abcd

On Tue, Mar 26, 2013 at 10:59 AM, Georg Rudoy notifications@github.comwrote:

Turned out longer ngrams suck as well, effectively lowering the search quality.

Your example is perfectly valid, but imagine you search for abcde in a string like " ab zz bc zz cd zz abc zz de " (where ngrams occur several times, and relative positions may vary depending on the subset you take). In this case it's a combinatorial task that can't be solved in reasonable time without some simplifying assumptions that will surely be broken in some queries.

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15463206 .

www.barzer.net

0xd34df00d commented 11 years ago

That's what I'm talking about — it's easy to determine if there are holes, but it's hard to determine if holes are significant.

barzerman commented 11 years ago

i think whats important is the relative length of all holes to the length of the searched sequence if im searching for a 10 token sequence and there are many holes - the result is shitty

On Tue, Mar 26, 2013 at 11:03 AM, Georg Rudoy notifications@github.comwrote:

That's what I'm talking about — it's easy to determine if there are holes, but it's hard to determine if holes are significant.

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15463485 .

www.barzer.net

barzerman commented 11 years ago

i'm not sure this is a serious bug actually . i need an example of BENI results beign preferred due to abnormally high coverage

0xd34df00d commented 11 years ago

We have this penalty already.