fazalmajid / temboz

The Temboz RSS/Atom feed reader
MIT License
81 stars 4 forks source link

Stemming causes false positives in filters #114

Closed fazalmajid closed 9 years ago

fazalmajid commented 9 years ago

The Porter2 stemming algorithm introduced in 6488fa8e1bf0a4de900a5daaac32fd13091f009b has an unfortunate side-effect of increased false positives. One example: the stem for "wellness" is "well", which catches way too much (it's not a stop word, but close).

To address this:

  1. if the user chooses word-based rules, the text box should reflect the stemmed version
  2. There should be new options title_word_exact, content_word_exact and union_word_exact that implement the old pre-Porter2 algorithm
fazalmajid commented 9 years ago

Temboz should also have a report giving stats on the top filters in the last 2 weeks

fazalmajid commented 8 years ago

To identify the rules that are running amok, the following query can help:

select * from (
  select rule_text, sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end) as before,
  sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end) as after
  from fm_rules join fm_items on item_rule_uid=rule_uid
  where rule_type like '%_word' group by 1
) order by after/(before + 1) desc limit 20;

where 2015-06-22 should be replaced by whenever you deployed the Porter2 changes.

fazalmajid commented 8 years ago

Better yet:

select * from (
  select rule_uid, rule_type, rule_text,
    sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end)
      as before,
    sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end)
      as after
  from fm_rules
  join fm_items on item_rule_uid=rule_uid
  where rule_type like '%_word' and rule_type not like '%exactword'
  group by 1,2,3
)
order by after/(before + 1) desc
limit 20;
fazalmajid commented 8 years ago

Mitigation involves:

update fm_rules set rule_type=replace(rule_type, '_word', '_exactword')
where rule_uid in (...,...,...);
fazalmajid commented 8 years ago

Or:

update fm_rules set rule_type=replace(rule_type, '_word', '_exactword')
where rule_uid in (
  select rule_uid from (
    select rule_uid, rule_type, rule_text,
      sum(case when item_loaded < julianday('2015-06-22') then 1 else 0 end)
        as before,
      sum(case when item_loaded > julianday('2015-06-22') then 1 else 0 end)
        as after
    from fm_rules
    join fm_items on item_rule_uid=rule_uid
    where rule_type like '%_word' and rule_type not like '%exactword'
    group by 1,2,3
  )
  order by after/(before + 1) desc
  limit 20
);