Closed fazalmajid closed 9 years ago
Temboz should also have a report giving stats on the top filters in the last 2 weeks
To identify the rules that are running amok, the following query can help:
select * from (
select rule_text, sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end) as before,
sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end) as after
from fm_rules join fm_items on item_rule_uid=rule_uid
where rule_type like '%_word' group by 1
) order by after/(before + 1) desc limit 20;
where 2015-06-22 should be replaced by whenever you deployed the Porter2 changes.
Better yet:
select * from (
select rule_uid, rule_type, rule_text,
sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end)
as before,
sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end)
as after
from fm_rules
join fm_items on item_rule_uid=rule_uid
where rule_type like '%_word' and rule_type not like '%exactword'
group by 1,2,3
)
order by after/(before + 1) desc
limit 20;
Mitigation involves:
update fm_rules set rule_type=replace(rule_type, '_word', '_exactword')
where rule_uid in (...,...,...);
Or:
update fm_rules set rule_type=replace(rule_type, '_word', '_exactword')
where rule_uid in (
select rule_uid from (
select rule_uid, rule_type, rule_text,
sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end)
as before,
sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end)
as after
from fm_rules
join fm_items on item_rule_uid=rule_uid
where rule_type like '%_word' and rule_type not like '%exactword'
group by 1,2,3
)
order by after/(before + 1) desc
limit 20
);
The Porter2 stemming algorithm introduced in 6488fa8e1bf0a4de900a5daaac32fd13091f009b has an unfortunate side-effect of increased false positives. One example: the stem for "wellness" is "well", which catches way too much (it's not a stop word, but close).
To address this: