evanpurkhiser / CS-Karat-Sleuth

A simplistic spam heuristics tool written in the Ruby programming language – Fall 2013 AI
MIT License
0 stars 0 forks source link

We need spam/ham data sets! #1

Closed evanpurkhiser closed 10 years ago

evanpurkhiser commented 10 years ago

We are going to want to get as much spam and ham (not spam) emails as we can get our hands on to train our program.

Post any links you can find of spam/ham data sets in this issue.

We're looking for any valid plain-text email source. That includes all headers and body in-tact.

We can close this issue once we have enough high quality sets. We will probably want to organize the messages into spam and ham folders, each message being a single file. If they aren't already in a format like that it should be trivial to parse them apart and stick them in folders.

evanpurkhiser commented 10 years ago

Here's some from the CSMING group. Thanks @hmm34.

hmm34 commented 10 years ago

SpamAssasin also has a nice set.

evanpurkhiser commented 10 years ago

So should we consider the CSMING group of data 'easy' or 'hard' spam?

Also, I sort of feel like we should just discard the data set in there that has unidentified emails. Somewhat useless to us.

hmm34 commented 10 years ago

We could still use the content for the Bayesian filtering, though I've no problem with decreasing the training data - to such a point where I don't mind getting rid of the 'hard' ones and combining the easy with CSMING. Or just joining the easy and hard sets together. Going through all the CSMING data and manually classifying them seems arduous.

evanpurkhiser commented 10 years ago

We have ~5.3k spam and almost 8k ham messages to train from... Unless that's really not enough I think we should be a-ok here