We need some design specs

hmm34 commented 10 years ago

Designing! The most exciting part of the development process - where everything is possible!

We could throw all our implementation ideas in this issue. For diagram type items, Gliffy and Lucid Charts are options for image collaboration, though Lucid Charts supports only limited complexity for a free account.

Other suggestions are welcome.

mayaswrath commented 10 years ago

I wouldnt mind lucid charts, we use it at work and I should probably learn it at some point. I never really document design so I dont have any personal experience with any tools.

evanpurkhiser commented 10 years ago

Yeah, we should probably be proactive with this.

This was going to have a lot of design information, in it, but I ended up writing more generally... Anyway, here are some ideas that I have. I have some questions interspersed in here, so feedback is welcome:

If we can, I would like to use Ruby >= 2.0. I think OSX comes default with maybe 1.8, so if you're on a Mac you'll have to use rbenv or RVM. Alternatively you may be able to use brew or ports to install the latest version of ruby.
We should try and standardize on some kind of stylistic stuff. For indentation I would suggest we use two spaces for indentation (I'm pretty sure this is fairly standard for ruby). Other ruby-isms will just have to get figured out as we work, I've done some work with ruby here so I should at least be able to give some feedback on what is good stylistic practice and what's not.
We're going to probably want to represent email messages in a object oriented fashion. So we probably will want a class that allows us to load a email message from a string (or maybe from a file?) and then we can retrieve things about it, like the plaintext body, sent dates, stuff like that.. I think this is already done for us.

Checking emails

Like we had discussed I think it would be pretty cool if we could design this part of the application as a 'plugin' like pipeline for determining the 'probability of spam' (or maybe spam rank, can we get a 0-1 confidence value?).
The heuristics we can use:
- Bayesian filtering, see "Learning from emails" below
- DKIM signing check
- Region checking, e.g. China = bad (We will probably want to have some settings for what region we're 'in')
- Only HTML content, no plain text?
- Other things?
It would be nice if we could calculate a spam confidence kinda thing. Then we just need a magic value to be considered boolean spam (is it or is it not spam).

Learning from emails

Correct me if I'm wrong here, but I think the only part of our program that will be learning is going to be the Bayesian portion of our software.
Again, correct me if I'm wrong here, but is "learning" from a single email considered to be a independent operation? In other words, does the learning portion of the software have to know anything about what it's already learned?
Do we only learn from the spam emails, or do we store information about ham as well?
I think we had also discussed that we wanted to have the ability to mark learned emails as either 'global spam' or as 'individual spam'. We probably want to somehow give more weight to the information we store about emails with some kind of user identifier attached. Not sure.
How do we store the learned information in the database?

I'll try and answer some of these questions myself.

hmm34 commented 10 years ago

:8ball: Why is there an 8 ball here?

Also, for general design, here's the basic idea I've got - along with questions. I'm imagining these as 3 separate classes, but if there's a better way to go about it I'm all for it.

A learning interface to add elements to and query the DB (knowledge-base)
- Separate from inference engine so that it can be provided known sets
- If user classifies message as spam/ham, it would go here
- Would the user's classification be marked as higher priority?
Inference engine that uses above to query elements for filtering
- input: full email
- runs bayesian filtering with email contents
- performs other heuristics here
- output? A simple yes/no spam? If karat sleuth determines it to be spam, would its contents also be added to the knowledge base as known spam with a lower confidence?
Controller / plugin entry to grab email (a main.cpp for ruby)
- Use above classes to classify email and add it to knowledge base
- Would it also run our test data set and display confusion matrix?

As a plug-in feature, how would Karat Sleuth know that the user has classified an email as ham/spam? I imagine the email gets resent, thus going back through our filter - but how do we know that it's to be used as training data?

For learning :

Agreed that the Bayesian portion is the only learning part
I don't think the learning part has to know about previously learned items, only where that info is stored so it can add the new data into the knowledge base.
Part of the Bayesian formula uses information about ham messages (frequency of word appearing in ham & spam), so we would need to store ham info too.
As for storing the learned info in a DB... the only thing that comes to mind is a list of words/phrases contained within all emails (which could include items from the header), along with an attribute for number of times it has appeared in spam and number of times it's appeared in ham. I'm not sure how to incorporate global/individual spam into this structure though. We could, if the user marked an email as spam, add the content of it to the database with an additional count of frequency to make it higher priority.

evanpurkhiser commented 10 years ago

Here, have some :cake:

hmm34 commented 10 years ago

Are we specifying that a given item or set of mail is ham/spam on the command line while training, or would that be marked somewhere within the file?

evanpurkhiser commented 10 years ago

This is what I was thinking for our concrete testing sets:

At the root directory of the project we have a directory named training_sets. Then inside of that directory we have two directories spam and ham. I was working on setting up a rake task to download all of our training data and organize it this way in #4.

So we have a few options for command line parsing.

If no arguments are specified, look at the root of the project for the training_sets/{ham,spam} and do processing based on that
Give the program a path to a file or directory of *.msg (or what ever extension) file(s) and then tell it if they are ham or spam. So from the command line this would look like karatslueth /home/evan/ham-message-dir ham (where the second argument tells it that it's ham).
If no second argument is given but a directory path is given as the first argument then we could look to see if the path contains ham / spam folders with emails in them.

I actually really like that third idea, because then we can be more generic about the first case. As in, if no arguments are given then we just default the directory path to $PROJECT_ROOT/training_sets and let it find the ham / spam folders.

These are just a few neat ideas for later on I guess, we probably don't want to worry too much about the interface right now. For now we can just hardcode it to look at the project directory I think.

evanpurkhiser commented 10 years ago

For database interaction I think Sequel looks really nice. We should check it out. Looks super fun to work with.

I have some ideas floating around in my head about the database structure, but nothing concrete just yet. When I get a chance to organize these thoughts I'll do a brain dump here =P

evanpurkhiser commented 10 years ago

Just had the realization we will probbalby need some kind of HTML parsing library for HTML emails

<a href="some-url" someAttribute someOtherAttribute>actual link text</a>

Obviously 'someAttribute', 'someOtherAttribute', and 'href' aren't actually words we want to store into our database.

Maybe we can find a library that will strip all non-text nodes from the HTML version of the email. and just leave us with plain text.

hmm34 commented 10 years ago

True. On another note, if we did leave those items in the database, without stripping them, it should even out with a relatively equal amount of uses in spam/ham sets - making the confidence level neutral in either direction. Buut.. you're right, the attributes/href/other html stuff would be incredibly redundant and unnecessary info we don't need or would use for filtering.

evanpurkhiser commented 10 years ago

Ok. So what kind of words are we ignoring? Things like 'the', 'and', 'is' make sense to ignore. (What are these called? Can we find a big list of them?)

hmm34 commented 10 years ago

This is the best that I could find in terms of a basic text list. From the English Function word set, we'll probably want the auxiliary verbs, determiners, conjunctions, and prepositions. We could easily add on articles, 'a', 'an', 'the'. I know there's still words missing from here that we'd want to ignore, but I'm not sure what to call them.

hmm34 commented 10 years ago

On another note, Here's a gem for Bayesian filtering that we might be able to use...

evanpurkhiser commented 10 years ago

I think we've concluded: The answer here is "yes" :neutral_face:

evanpurkhiser / CS-Karat-Sleuth