barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

stemming bug #602

Closed bodritto closed 11 years ago

bodritto commented 11 years ago

есть паттерн "ипотека", но "ипотеки" почему-то не матчится аналогично "ипотечная программа" и "ипотечной программы"

user: sveta/rrbank

barzerman commented 11 years ago

i'm looking at it as well .. this is bad . we should get these stemming issues once and for all

bodritto commented 11 years ago

"дебетовых" - "дебетовая"

barzerman commented 11 years ago

cannot reproduce in a small ruleset

barzerman commented 11 years ago

the bug disappears when extrawords (spell/extra tag) are turned off . @0xd34df00d please investigate why does the extrawords thing fuck it up. it may be related to the way these words are loaded (not the same as tokens)

bodritto commented 11 years ago

'мобильного банка' ---X--> 'мобильный банк'

0xd34df00d commented 11 years ago

Seems like stuff gets corrected to that stuff (initial guess from blindly looking at the code).

Could you please provide a minimal reproducing example?

barzerman commented 11 years ago

it reproduces if the word in question is in the dictionary . @inggris has purged it on sveta but /home/yanis/public_html/rrbank_extrawords.txt heres the copy of the dictionary which caused it to crash

0xd34df00d commented 11 years ago

Well, after some playing around with rules file and code base I can't reproduce it anymore even from scratch.

barzerman commented 11 years ago

do you have extrawords turned on?

On Wed, Jul 31, 2013 at 1:54 PM, Georg Rudoy notifications@github.comwrote:

Well, after some playing around with rules file and code base I can't reproduce it anymore even from scratch.

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/602#issuecomment-21854128 .

www.barzer.net

0xd34df00d commented 11 years ago

Sure.

It'd be much easier if you just put the offending config somewhere.

barzerman commented 11 years ago

take config from production and put the right file there .

On Wed, Jul 31, 2013 at 2:20 PM, Georg Rudoy notifications@github.comwrote:

Sure.

It'd be much easier if you just put the offending config somewhere.

— Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/602#issuecomment-21855184 .

www.barzer.net

0xd34df00d commented 11 years ago

Taken config from production, replaced the rules with a single pattern <t>хуй</t><t>ипотека</t>, хуй ипотеки still matches. ипотека is present in extrawords.

barzerman commented 11 years ago

how about the whole 1000200 original set

On Wed, Jul 31, 2013 at 2:31 PM, Georg Rudoy notifications@github.comwrote:

Taken config from production, replaced the rules with a single pattern

ÈÕÊÉÐÏÔÅËÁ, ÈÕÊ ÉÐÏÔÅËÉ still matches. ÉÐÏÔÅËÁ is present in extrawords. ## Reply to this email directly or view it on GitHubhttps://github.com/barzerman/barzer/issues/602#issuecomment-21855620 .

www.barzer.net

0xd34df00d commented 11 years ago

That doesn't seem like a minimal reproducing example, and, moreover, I'm afraid that having different results now is a sign of a bigger hidden problem, from a local heisenbug to misunderstand of the bug description, thus IMO the best solution here is to sync our results on a smaller and saner dataset.

Though I'll use the whole dataset if sending me the (presumably) already existing data is that troublesome.

barzerman commented 11 years ago

is this ready to be merged?

0xd34df00d commented 11 years ago

Yep, if you find the changes are OK.