RFC: Message Anti-Spam Module

lasley commented 7 years ago

We are currently planning a new module to combat spam messages in Odoo. The plan at the moment is to implement a basic backpropagation algorithm similar to this.

The algorithm will be trained initially using some open source datasets like Enron, SpamAssassin, LingSpam, etc.

It will create a private Junk channel for each user, automatically filtering identified spam to it. Any item moved out of junk would be marked as a false positive, and anything moved into junk would be considered a false negative. TBD how this will play into the learning algorithm.

Messages would be evaluated in the create method of mail.message.

Spam scanning will need to be strictly opt-in. I'm thinking it makes sense to put this at the model level; something similar to the ir.model.website_form_access used for forms, with the mail.message.res_model dictating the model.

Challenges I currently foresee:

The obvious - training the algorithm
Support of multiple languages
Handling of spam not directed to specific users (projects, help desk, blog, crm)
Storage and subsequent search of data. Looks like pickle may work though, but with some challenges in terms of upgrades - we will need to add versioning, re-training, and a/b testing.

Nice to haves:

Model training on a user level (Cindy may like spam from CompanyA, but John might not) - this will require training the SVM in an incremental manor, instead of batch.
- This is not so easy in sklearn, and we will be limited to linear case using Stochastic Gradient Descent (SGD). Would need to use something like pagasos instead
Client/Server model, allowing for centralized & offloaded training

Additional reading for anyone interested:

rafaelbn commented 7 years ago

😲 Thats could be great!

tarteo commented 7 years ago

@rafaelbn agree!

hbrunn commented 7 years ago

sounds technically interesting, but what's your reasoning not to let the mailserver do this? And maybe add a module that reacts to the headers spamassassin or whoever sets?

yajo commented 7 years ago

On one side, it's an amazing project, some ideas that came to my mind:

A good job for TensorFlow? That's what Google uses AFAIK...
Maybe you could benefit from magma project.

On the other side, @hbrunn's idea might be much easier to implement, with quite close results.

lasley commented 7 years ago

@hbrunn - My main intent behind this is not necessarily to combat messages coming through email, but instead messages coming through the web interface.

One of the current blockers for me getting rid of Wordpress is the ability to have a blog that isn't immediately spammed to hell. I currently use Akismet, but it's non-free so I don't want to bother making an Odoo plugin.

@Yajo - TensorFlow looks much easier than sklearn, thanks for the pointer. But what is this magma project you speak of?

yajo commented 7 years ago

OK I had a couple of mistakes there. I didn't remember the name of the man who was in charge of the project (I found him now: @OSevangelist :blush:), and the project is Mackma Project, not Magma :laughing:. He spoke about it in last OCA sprint. They are planning to implement big data management inside Odoo by adding a Hadoop backend to its ORM if I'm not wrong; I hope he can enlighten us to know if this issue would fit in their project (or vice-versa).

yajo commented 7 years ago

BTW, this might give you some ideas too @lasley: https://github.com/OCA/website/tree/10.0/website_crm_recaptcha

lasley commented 7 years ago

Oh sweet yeah as I've been learning the ML required for this, I've realized it can be used for a lot of other things. Mackma Project sounds interesting, I was unaware that it was being abstracted for base use. I'm not really able to find much information on it though, so maybe @OSevangelist can enlighten a bit more?

Hah I made that ReCaptcha plugin @Yajo - it's my spam stopgap for CRM for the moment, but I don't want a captcha on my blog comments. Some spammers are just incredibly low wage humans too, which isn't stopped by the ReCaptcha unfortunately.

lasley commented 7 years ago

Alright so the machine learning on this ended up being an insane rabbit hole. I learned a lot, but I think anything I implement will still be sub-par.

That said, I recently found PyZor, which could serve our needs. It's basically just a crowd-sourced message signature checker, which works pretty damned well from my initial tests.

Has anyone used PyZor before? The authors seem to have some other interesting Spam things such as an OS drop-in replacement for SpamAssassin, so it seems they know what they're doing.

yajo commented 7 years ago

No experience on that at all, but it seems a good choice. Honestly I don't feel like being able to judge about a machine learning PR, but a PR that uses a library can be easily evaluated. I feel we are not the anti-spam design crew.

liebana commented 7 years ago

We are going to take a look to PyZor, does anyone have any news to share?

lasley commented 7 years ago

I've played with it now and think it's a good fit for this project. At this point, it's successfully identified all spam that has come through on an unsecured form honeytrap.

We're a bit overloaded at the moment though & this is somewhat of a side thing for me, so I haven't been able to allocate any dev time to create the actual module. IMO the hardest part is going to be the workflow - PyZor was a few lines.

lasley commented 7 years ago

Closing to track in #193

OCA / social

RFC: Message Anti-Spam Module #121