File Name: Email Classifier.R Due: 02/28/14 at 5:00 p.m. Author: Tyler Soellinger RE: Replicate the Spam versus Ham Classifier Example from Chapter 3 of Machine Learning for Hackers Data Used: Email messages contained in data/ directory, source: http://spamassassin.apache.org/publiccorpus
6. This Does The Heavy Lifting For Classifying Email By Taking Two Required Parameters; A File Path To An Email To Classify, And A Data Frame Of The Trained Data. This Also Takes Two Optional Parameters: First, A Prior Over The Probability That An Email Is SPAM, Which We Set To 0.5, And Constant Value For The Probability On Words In The Email That Are Not In Our Training Data
This Function Returns The Naive Bayes Probability That The Given Email Is SPAM
I.Introduction
File Name: Email Classifier.R
Due: 02/28/14 at 5:00 p.m.
Author: Tyler Soellinger
RE: Replicate the Spam versus Ham Classifier Example from Chapter 3 of Machine Learning for Hackers
Data Used: Email messages contained in data/ directory, source: http://spamassassin.apache.org/publiccorpus
II.Code
1. Load Libraries
2.Set Global Paths For Email Archives
3. Create Motivating Plot
4. Return A Single Element Vector Of Just The Email Body With Words As Features
The Mesage Always Begins After The First Full Line Break
5. Create A TermDocumentMatrix (TDM) From The Corpous Of SPAM Email
This TDM Creates The Feature Set Used To Train Our Classifier
This Function Takes A File Path To An Email File And A String, The Term Parameter, And Returns The Count Of That Term In The Email Body
When Nothing Is Found, "Ifelse" Is Used To Produce "0" Instead Of "NA"
6. This Does The Heavy Lifting For Classifying Email By Taking Two Required Parameters; A File Path To An Email To Classify, And A Data Frame Of The Trained Data. This Also Takes Two Optional Parameters: First, A Prior Over The Probability That An Email Is SPAM, Which We Set To 0.5, And Constant Value For The Probability On Words In The Email That Are Not In Our Training Data
This Function Returns The Naive Bayes Probability That The Given Email Is SPAM
Get The Email Text In A Workable Format
Find Intersection Of Words
Perform The Naive Bayes Calculation
7. Perform The Classifications
(1) Create A Document Corpus For Spam Messages
Place All The SPAM-y Email Into A Single Vector
Create A DTM For That Vector
Create A Data Frame With The Feature Set From The Training SPAM Data
Add The Term Density And Occurrence Rate
(2) Repeat For EasyHam Email
8. Run Classifier Against Hard Ham
9. Find Counts Of Just Terms 'html' And 'table' In All SPAM And EASYHAM Docs, And Create Figure
Plot 1
Plot 2
10. Classify HARDHAM Data Using The Classifier Developed Above
11. Get Lists Of All The Email Messages
Classify All Of Them
12. Create A Single, Final Data Frame With All Classifications Of Data
13. Create Final Plot Of Results
14. Save Results As A Table
15. Result Outputs
16. Save The Training Data For Potential Future Use
END