Closed lasley closed 7 years ago
😲 Thats could be great!
@rafaelbn agree!
sounds technically interesting, but what's your reasoning not to let the mailserver do this? And maybe add a module that reacts to the headers spamassassin or whoever sets?
On one side, it's an amazing project, some ideas that came to my mind:
On the other side, @hbrunn's idea might be much easier to implement, with quite close results.
@hbrunn - My main intent behind this is not necessarily to combat messages coming through email, but instead messages coming through the web interface.
One of the current blockers for me getting rid of Wordpress is the ability to have a blog that isn't immediately spammed to hell. I currently use Akismet, but it's non-free so I don't want to bother making an Odoo plugin.
@Yajo - TensorFlow looks much easier than sklearn, thanks for the pointer. But what is this magma project you speak of?
OK I had a couple of mistakes there. I didn't remember the name of the man who was in charge of the project (I found him now: @OSevangelist :blush:), and the project is Mackma Project, not Magma :laughing:. He spoke about it in last OCA sprint. They are planning to implement big data management inside Odoo by adding a Hadoop backend to its ORM if I'm not wrong; I hope he can enlighten us to know if this issue would fit in their project (or vice-versa).
BTW, this might give you some ideas too @lasley: https://github.com/OCA/website/tree/10.0/website_crm_recaptcha
Oh sweet yeah as I've been learning the ML required for this, I've realized it can be used for a lot of other things. Mackma Project sounds interesting, I was unaware that it was being abstracted for base use. I'm not really able to find much information on it though, so maybe @OSevangelist can enlighten a bit more?
Hah I made that ReCaptcha plugin @Yajo - it's my spam stopgap for CRM for the moment, but I don't want a captcha on my blog comments. Some spammers are just incredibly low wage humans too, which isn't stopped by the ReCaptcha unfortunately.
Alright so the machine learning on this ended up being an insane rabbit hole. I learned a lot, but I think anything I implement will still be sub-par.
That said, I recently found PyZor
, which could serve our needs. It's basically just a crowd-sourced message signature checker, which works pretty damned well from my initial tests.
Has anyone used PyZor before? The authors seem to have some other interesting Spam things such as an OS drop-in replacement for SpamAssassin, so it seems they know what they're doing.
No experience on that at all, but it seems a good choice. Honestly I don't feel like being able to judge about a machine learning PR, but a PR that uses a library can be easily evaluated. I feel we are not the anti-spam design crew.
We are going to take a look to PyZor, does anyone have any news to share?
I've played with it now and think it's a good fit for this project. At this point, it's successfully identified all spam that has come through on an unsecured form honeytrap.
We're a bit overloaded at the moment though & this is somewhat of a side thing for me, so I haven't been able to allocate any dev time to create the actual module. IMO the hardest part is going to be the workflow - PyZor was a few lines.
Closing to track in #193
We are currently planning a new module to combat spam messages in Odoo. The plan at the moment is to implement a basic backpropagation algorithm similar to this.
The algorithm will be trained initially using some open source datasets like Enron, SpamAssassin, LingSpam, etc.
It will create a private Junk channel for each user, automatically filtering identified spam to it. Any item moved out of junk would be marked as a false positive, and anything moved into junk would be considered a false negative. TBD how this will play into the learning algorithm.
Messages would be evaluated in the
create
method ofmail.message
.Spam scanning will need to be strictly opt-in. I'm thinking it makes sense to put this at the model level; something similar to the
ir.model.website_form_access
used for forms, with themail.message.res_model
dictating the model.Challenges I currently foresee:
Nice to haves:
CompanyA
, but John might not) - this will require training the SVM in an incremental manor, instead of batch.sklearn
, and we will be limited to linear case using Stochastic Gradient Descent (SGD). Would need to use something likepagasos
insteadAdditional reading for anyone interested: