fmbla / spamassassin-concepts

2 stars 3 forks source link

Is it just me or are a lot of concepts matching on 1 single word? #2

Open ovizii opened 8 years ago

ovizii commented 8 years ago

So basically an email I received as a reply to a technical support ticket I had opened was tagged with the following concepts: hello https newsletter re fwd satisfaction ticket good-day stranger time-ref news regards thankyou great

just from a first glance, I'd expect hello, newsletter, time-ref, news, regards, thankyou to match HAM as much as SPAM. Is this not diluting my BAYES DB? Shouldn't the concepts not be matched more "stricter"?

Maybe I'm completely wrong here but do you have any figures/stats you can show so we can get a better idea of how this is working out for you?

steadramon commented 8 years ago

As mentioned in the previous issue #1 the Bayes tokens created from the Concepts plugin differ from the plain words appearing in the body - the token for the concept "hello" differs from the body token "hello"

The project was done as a proof of concept, I did some initial testing which showed that the concepts idea worked in principle but would take considerable effort to keep up with trends and matching concepts. Overall I found that common concepts indeed had a 50/50 ratio, however some more specific concepts and combinations had good indicators - just need more of them.

ovizii commented 8 years ago

What would be the right way to go about this:

a) create concepts which match either bad or good emails? b) create concepts which match both good and bad emails and rely on the fact that bayes will learn which combination of concepts are good and which combinations are bad?

ovizii commented 8 years ago

forgot to ask: how did you measure the impact concepts have? You mentioned a 50/50 ratio...

steadramon commented 8 years ago

B is the more likable scenario, however I can see where A would come into place.

With a bit of fudging it is possible to get the tokens and their stats back out of the Bayes DB - the sa-learn CLI tool regex doesn't work as I'd expect - I have to look at getting stats out properly