Open ovizii opened 8 years ago
As mentioned in the previous issue #1 the Bayes tokens created from the Concepts plugin differ from the plain words appearing in the body - the token for the concept "hello" differs from the body token "hello"
The project was done as a proof of concept, I did some initial testing which showed that the concepts idea worked in principle but would take considerable effort to keep up with trends and matching concepts. Overall I found that common concepts indeed had a 50/50 ratio, however some more specific concepts and combinations had good indicators - just need more of them.
What would be the right way to go about this:
a) create concepts which match either bad or good emails? b) create concepts which match both good and bad emails and rely on the fact that bayes will learn which combination of concepts are good and which combinations are bad?
forgot to ask: how did you measure the impact concepts have? You mentioned a 50/50 ratio...
B is the more likable scenario, however I can see where A would come into place.
With a bit of fudging it is possible to get the tokens and their stats back out of the Bayes DB - the sa-learn CLI tool regex doesn't work as I'd expect - I have to look at getting stats out properly
So basically an email I received as a reply to a technical support ticket I had opened was tagged with the following concepts:
hello https newsletter re fwd satisfaction ticket good-day stranger time-ref news regards thankyou great
just from a first glance, I'd expect hello, newsletter, time-ref, news, regards, thankyou to match HAM as much as SPAM. Is this not diluting my BAYES DB? Shouldn't the concepts not be matched more "stricter"?
Maybe I'm completely wrong here but do you have any figures/stats you can show so we can get a better idea of how this is working out for you?