evanpurkhiser / CS-Karat-Sleuth

A simplistic spam heuristics tool written in the Ruby programming language – Fall 2013 AI
MIT License
0 stars 0 forks source link

We need some design specs #3

Closed hmm34 closed 10 years ago

hmm34 commented 10 years ago

Designing! The most exciting part of the development process - where everything is possible!

We could throw all our implementation ideas in this issue. For diagram type items, Gliffy and Lucid Charts are options for image collaboration, though Lucid Charts supports only limited complexity for a free account.

Other suggestions are welcome.

mayaswrath commented 10 years ago

I wouldnt mind lucid charts, we use it at work and I should probably learn it at some point. I never really document design so I dont have any personal experience with any tools.

evanpurkhiser commented 10 years ago

Yeah, we should probably be proactive with this.

This was going to have a lot of design information, in it, but I ended up writing more generally... Anyway, here are some ideas that I have. I have some questions interspersed in here, so feedback is welcome:

Checking emails

Learning from emails

I'll try and answer some of these questions myself.

hmm34 commented 10 years ago

:8ball: Why is there an 8 ball here?

Also, for general design, here's the basic idea I've got - along with questions. I'm imagining these as 3 separate classes, but if there's a better way to go about it I'm all for it.

As a plug-in feature, how would Karat Sleuth know that the user has classified an email as ham/spam? I imagine the email gets resent, thus going back through our filter - but how do we know that it's to be used as training data?

For learning :

evanpurkhiser commented 10 years ago

Here, have some :cake:

hmm34 commented 10 years ago

Are we specifying that a given item or set of mail is ham/spam on the command line while training, or would that be marked somewhere within the file?

evanpurkhiser commented 10 years ago

This is what I was thinking for our concrete testing sets:

At the root directory of the project we have a directory named training_sets. Then inside of that directory we have two directories spam and ham. I was working on setting up a rake task to download all of our training data and organize it this way in #4.

So we have a few options for command line parsing.

I actually really like that third idea, because then we can be more generic about the first case. As in, if no arguments are given then we just default the directory path to $PROJECT_ROOT/training_sets and let it find the ham / spam folders.

These are just a few neat ideas for later on I guess, we probably don't want to worry too much about the interface right now. For now we can just hardcode it to look at the project directory I think.

evanpurkhiser commented 10 years ago

For database interaction I think Sequel looks really nice. We should check it out. Looks super fun to work with.

I have some ideas floating around in my head about the database structure, but nothing concrete just yet. When I get a chance to organize these thoughts I'll do a brain dump here =P

evanpurkhiser commented 10 years ago

Just had the realization we will probbalby need some kind of HTML parsing library for HTML emails

<a href="some-url" someAttribute someOtherAttribute>actual link text</a>

Obviously 'someAttribute', 'someOtherAttribute', and 'href' aren't actually words we want to store into our database.

Maybe we can find a library that will strip all non-text nodes from the HTML version of the email. and just leave us with plain text.

hmm34 commented 10 years ago

True. On another note, if we did leave those items in the database, without stripping them, it should even out with a relatively equal amount of uses in spam/ham sets - making the confidence level neutral in either direction. Buut.. you're right, the attributes/href/other html stuff would be incredibly redundant and unnecessary info we don't need or would use for filtering.

evanpurkhiser commented 10 years ago

Ok. So what kind of words are we ignoring? Things like 'the', 'and', 'is' make sense to ignore. (What are these called? Can we find a big list of them?)

hmm34 commented 10 years ago

This is the best that I could find in terms of a basic text list. From the English Function word set, we'll probably want the auxiliary verbs, determiners, conjunctions, and prepositions. We could easily add on articles, 'a', 'an', 'the'. I know there's still words missing from here that we'd want to ignore, but I'm not sure what to call them.

hmm34 commented 10 years ago

On another note, Here's a gem for Bayesian filtering that we might be able to use...

evanpurkhiser commented 10 years ago

I think we've concluded: The answer here is "yes" :neutral_face: