Closed evanpurkhiser closed 10 years ago
Here's some from the CSMING group. Thanks @hmm34.
So should we consider the CSMING group of data 'easy' or 'hard' spam?
Also, I sort of feel like we should just discard the data set in there that has unidentified emails. Somewhat useless to us.
We could still use the content for the Bayesian filtering, though I've no problem with decreasing the training data - to such a point where I don't mind getting rid of the 'hard' ones and combining the easy with CSMING. Or just joining the easy and hard sets together. Going through all the CSMING data and manually classifying them seems arduous.
We have ~5.3k spam and almost 8k ham messages to train from... Unless that's really not enough I think we should be a-ok here
We are going to want to get as much spam and ham (not spam) emails as we can get our hands on to train our program.
Post any links you can find of spam/ham data sets in this issue.
We're looking for any valid plain-text email source. That includes all headers and body in-tact.
We can close this issue once we have enough high quality sets. We will probably want to organize the messages into
spam
andham
folders, each message being a single file. If they aren't already in a format like that it should be trivial to parse them apart and stick them in folders.