Closed nchepanov closed 11 years ago
meanwhile: ECMPCV40
<beni>
<entity id="33683" n="Микрофон SONY ECM-PCV40" r="0" class="1000206" subclass="7" lev="15"
cov="0.666667"/>
<entity id="38137" n="Мультиварка SMILE MPC-1140" r="0" class="1000206" subclass="7" lev="17"
cov="0.166667"/>
...
disambiguator sort out beni results with cov < 0.7 in this case cov is too low
By definition coverage is how much ngrams from source query is found in the result, this way we can have something that's normalized and sane. Meanwhile dash (in the first comment) kills a considerable amount of ngrams from the query, hence the low result.
I think it's better to normalize the strings instead.
Gosha, strings should already be normalized pls investigate asap
Sent from my iPhone
On Mar 26, 2013, at 6:42 AM, Georg Rudoy notifications@github.com wrote:
By definition coverage is how much ngrams from source query is found in the result, this way we can have something that's normalized and sane. Meanwhile dash (in the first comment) kills a considerable amount of ngrams from the query, hence the low result.
I think it's better to normalize the strings instead.
— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15451197 .
is it possible to take into consideration relative position of ngrams? in your case with query "a b c" cov( a x y z q u b o p r s t c) = 3 and cov(x y z q u a b c) = 3 as well
I think this is wrong
This is covered by longer ngrams, but with longer ngrams we lose more coverage in case of typos (like in the example with dash).
Positions by themselves make ngrams unstable to, well, relative positions and shifts.
Turned out longer ngrams suck as well, effectively lowering the search quality.
Your example is perfectly valid, but imagine you search for abcde in a string like " ab zz bc zz cd zz abc zz de " (where ngrams occur several times, and relative positions may vary depending on the subset you take). In this case it's a combinatorial task that can't be solved in reasonable time without some simplifying assumptions that will surely be broken in some queries.
we need to depress coverage for cases when matching ngrams have "holes" between them in other words if input sequence is abcd a xxxx b xxxx c xxxx d should produce significantly smaller coverage than abcd
On Tue, Mar 26, 2013 at 10:59 AM, Georg Rudoy notifications@github.comwrote:
Turned out longer ngrams suck as well, effectively lowering the search quality.
Your example is perfectly valid, but imagine you search for abcde in a string like " ab zz bc zz cd zz abc zz de " (where ngrams occur several times, and relative positions may vary depending on the subset you take). In this case it's a combinatorial task that can't be solved in reasonable time without some simplifying assumptions that will surely be broken in some queries.
— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15463206 .
www.barzer.net
That's what I'm talking about — it's easy to determine if there are holes, but it's hard to determine if holes are significant.
i think whats important is the relative length of all holes to the length of the searched sequence if im searching for a 10 token sequence and there are many holes - the result is shitty
On Tue, Mar 26, 2013 at 11:03 AM, Georg Rudoy notifications@github.comwrote:
That's what I'm talking about — it's easy to determine if there are holes, but it's hard to determine if holes are significant.
— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/514#issuecomment-15463485 .
www.barzer.net
i'm not sure this is a serious bug actually . i need an example of BENI results beign preferred due to abnormally high coverage
We have this penalty already.
3D телевизор
two full match from different ends of phrase shold not produce cov = 0.9