ScottPeterJohnson / purelymail-issues

Issues repository for the Purelymail email service.
39 stars 0 forks source link

SpamAssassin per-user database training #75

Open ScottPeterJohnson opened 2 years ago

ScottPeterJohnson commented 2 years ago

(This issue was imported from Gitea) stephan on July 15, 2021: Hi, I’m new to Purelymail - and overall very happy with the service. Thanks, Scott :-)

I have a question/feature suggestion regarding the training of the spam filter.

Every day, a few spam mails get through to my Inbox, which I then mark as spam (move to the "Junk" folder). That should train SpamAssassin for "spam".

Per the spam filtering FAQ, for the per-user database to kick in, I need to train SpamAssasin with 200 non-spam messages as well. Since no non-spam messages ever get to the Junk folder (no false-positives, good) I don’t have any “good” messages to move out of the "Junk" folder for training SpamAssasin.

So, my question is this: Is there another way to "show" "good" messages to SpamAssasin (e.g. all the messages in my "Archive" folder) for training?

ScottPeterJohnson commented 2 years ago

Comment by Scott on July 17, 2021: It looks like there isn't at the moment, which does somewhat limit the usefulness of SpamAssassins's bayes classifying. At one point all mail added by a user to the inbox was learned as ham, but that tended to bloat the spam DB and cause issues with large imports.

At some point I plan to replace SpamAssassins's bayes classifier with custom learned neural nets, but until then I can add a page for viewing SA status and manually adding mails.

(Sorry for the late response, on holiday of sorts)

ScottPeterJohnson commented 2 years ago

Comment by stephan on July 17, 2021: (Sorry, for closing the issue. I didn't mean to. I opened the issue to leave a comment a few hours ago, then left. When I returned to that tab, a full quote reply was submitted, which apparently also closed the issue. I was able to delete the full quote reply.)

> [...] but until then I can add a page for viewing SA status and manually adding mails.

I think, that would be great.

Just a thought: I'm not sure, how many users have (and use) an "Archive" folder. I would assume most, as it's an option in the Roundcube webmail. Maybe - to keep it simple - one could just use those mails (or a sample thereof) as "good" to train SpamAssassin.

In any case, thank you very much and please enjoy your holiday, Scott :-)

ScottPeterJohnson commented 2 years ago

Comment by rnkn on July 26, 2021: As an aside @Scott I know a while back you had an instance failure due to SpamAssassin eating up memory; I happened to have read about someone switching to rspamd recently and so thought I might share: https://dataswamp.org/~solene/2021-07-13-smtpd-rspamd.html

ScottPeterJohnson commented 2 years ago

Comment by seth on February 10, 2022: @Scott - I have 4 years' worth of known spam (about 20k messages) that you can have if you want to train a classifier. Just let me know.