Colin-Codes / IntentClassifier-ML-Project

Pyhton, Keras, SciKit-Learn, Matplotlib: Machine learning research project around classification of intent behind tech support emails in order to enable automatic follow up.
0 stars 0 forks source link

Filtering the data #20

Closed Colin-Codes closed 5 years ago

Colin-Codes commented 5 years ago

The anonymisation could be useful for the neural network anyway, as the placeholders will help to generalise.

Colin-Codes commented 5 years ago

Metadata filtering

Filter out all emails not sent to or CC OSS support or to the chatbot

Filter emails from OSS team members

(training only)No replies

Filter any default error messages

Formatting

Filter out whitespace, join subject to start

Filter out junky signature information

(training only)emails forwarded for action, remove header

Filter out everything after From:

Word filtering

FullName filtering

First or Last Name filtering (be wary of some examples eg. EasiAdmin, OSS Support etc that could cause filtering of useful words, and certain surnames that are unlikely to appear alone anyway)

Punctuation filtering

Email address filtering

Number filtering

Company filtering?

Greeting filtering

Copy the query from PowerQuery to get the exact rules.

Colin-Codes commented 5 years ago

let

Source = Excel.CurrentWorkbook(){[Name="Emails"]}[Content],

#"Changed Type" = Table.TransformColumnTypes(Source,{{"Subject", type text}, {"Body", type text}, {"From: (Name)", type text}, {"To: (Name)", type text}, {"CC: (Name)", type text}}),

#"Filtered Rows" = Table.SelectRows(#"Changed Type", each ([#"To: (Name)"] = "1.Origin IT Support;OSS Support" or [#"To: (Name)"] = "Accounts Receivable (Origin UK);OSS Support" or [#"To: (Name)"] = "Alexandra Giles;James Woodhead;Joanna Fardon;Nabil Awan;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Alexandra Giles;Nabil Awan;OSS Support;Marketing" or [#"To: (Name)"] = "Alexandra Giles;OSS Support" or [#"To: (Name)"] = "Ania Richards;EasiAdmin" or [#"To: (Name)"] = "Ania Richards;OSS Support" or [#"To: (Name)"] = "Ben Luff;OSS Support" or [#"To: (Name)"] = "Channa Duminda;Chris Miller;OSS Support" or [#"To: (Name)"] = "Channa Duminda;OSS Support" or [#"To: (Name)"] = "Chloe Cleere;OSS Support" or [#"To: (Name)"] = "Chris Miller;1.Origin IT Support;OSS Support" or [#"To: (Name)"] = "Chris Miller;Nabil Awan;OSS Support" or [#"To: (Name)"] = "Chris Miller;OSS Support" or [#"To: (Name)"] = "Chris Page;Nabil Awan;OSS Support" or [#"To: (Name)"] = "Claire Cunnick;OSS Support" or [#"To: (Name)"] = "Claire Cunnick;OSS Support;Chris Page" or [#"To: (Name)"] = "Claire Cunnick;OSS Support;Rob Sturgeon;Channa Duminda" or [#"To: (Name)"] = "Claire Cunnick;Pavla Mikulikova;OSS Support" or [#"To: (Name)"] = "Colin Younge;OSS Support" or [#"To: (Name)"] = "EasiAdmin" or [#"To: (Name)"] = "Elite;James Harley;OSS Support" or [#"To: (Name)"] = "Elite;OSS Support" or [#"To: (Name)"] = "Elite;OSS Support;Ania Richards;Louise Collis;OSS Support" or [#"To: (Name)"] = "Elite;OSS Support;Windows Production" or [#"To: (Name)"] = "Elite;Ricky Tailor;OSS Support;Tapiwa Tutisani" or [#"To: (Name)"] = "Elite;Robert Bruce;OSS Support" or [#"To: (Name)"] = "Elite;Solutions;Windows Production;OSS Support" or [#"To: (Name)"] = "Elite;Windows Production;OSS Support;Solutions" or [#"To: (Name)"] = "Hannah Price;Nabil Awan;OSS Support" or [#"To: (Name)"] = "Hannah Price;OSS Support" or [#"To: (Name)"] = "James Harley;Elite;OSS Support" or [#"To: (Name)"] = "James Harley;OSS Support" or [#"To: (Name)"] = "James Harley;OSS Support;Tapiwa Tutisani" or [#"To: (Name)"] = "James Woodhead;Joanna Fardon;Nabil Awan;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "James Woodhead;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "James Woodhead;Purchasing Team;Alexandra Giles;Joanna Fardon;Nabil Awan;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Joanna Fardon;EasiAdmin" or [#"To: (Name)"] = "Joanna Fardon;James Woodhead;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Joanna Fardon;James Woodhead;OSS Support;Rob Sturgeon;Marketing;Purchasing Team" or [#"To: (Name)"] = "Joanna Fardon;Nabil Awan;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Joe Pearcy;OSS Support" or [#"To: (Name)"] = "Katie Panonko;OSS Support" or [#"To: (Name)"] = "Katie Panonko;OSS Support;Louise Collis" or [#"To: (Name)"] = "Lauren Britnell;OSS Support" or [#"To: (Name)"] = "Lisa Wilkins;OSS Support" or [#"To: (Name)"] = "Logistics;EasiAdmin" or [#"To: (Name)"] = "Louise Collis;Chris Page;Nabil Awan;OSS Support;Ania Richards;Mike Mounter" or [#"To: (Name)"] = "Luke Richardson;OSS Support" or [#"To: (Name)"] = "Marcus Burnap;EasiAdmin" or [#"To: (Name)"] = "Marcus Burnap;EasiAdmin;Carolyn Bowden" or [#"To: (Name)"] = "Marcus Burnap;EasiAdmin;Radka White" or [#"To: (Name)"] = "Marcus Burnap;OSS Support" or [#"To: (Name)"] = "Maria Garcia;Accounts Receivable (Origin UK);OSS Support" or [#"To: (Name)"] = "Maria Garcia;OSS Support;Onboarding" or [#"To: (Name)"] = "Mark Tomlins;Nabil Awan;OSS Support;James Harley;Dean Franklin" or [#"To: (Name)"] = "Mark Tomlins;OSS Support;James Harley;Dean Franklin" or [#"To: (Name)"] = "Michal Zahradil;Tapiwa Tutisani;Louise Collis;OSS Support" or [#"To: (Name)"] = "mikemounter+8ncmhhmwaehnuqvzli46@boards.trello.com;EasiAdmin" or [#"To: (Name)"] = "Nabil Awan;1.Origin IT Support;EasiAdmin" or [#"To: (Name)"] = "Nabil Awan;Accounts Receivable (Origin UK);OSS Support" or [#"To: (Name)"] = "Nabil Awan;Claire Cunnick;OSS Support" or [#"To: (Name)"] = "Nabil Awan;Elite;OSS Support" or [#"To: (Name)"] = "Nabil Awan;James Woodhead;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Nabil Awan;Mark Tomlins;OSS Support;Dean Franklin" or [#"To: (Name)"] = "Nabil Awan;Orders South;OSS Support" or [#"To: (Name)"] = "Nabil Awan;OSS Support" or [#"To: (Name)"] = "Nabil Awan;OSS Support;James Harley;Dean Franklin;Anthony Smith" or [#"To: (Name)"] = "Nabil Awan;OSS Support;R&D" or [#"To: (Name)"] = "Nabil Awan;Pavla Mikulikova;OSS Support" or [#"To: (Name)"] = "Nabil Awan;Ricky Tailor;OSS Support" or [#"To: (Name)"] = "Nabil Awan;Solutions;Elite;Windows Production;OSS Support" or [#"To: (Name)"] = "Nick Evans;OSS Support" or [#"To: (Name)"] = "Oana Buzenchi;OSS Support" or [#"To: (Name)"] = "Orders North;OSS Support" or [#"To: (Name)"] = "Orders South;OSS Support" or [#"To: (Name)"] = "Orders South;OSS Support;Claire Cunnick" or [#"To: (Name)"] = "OSS Support" or [#"To: (Name)"] = "OSS Support;Claire Cunnick" or [#"To: (Name)"] = "OSS Support;James Harley;Dean Franklin" or [#"To: (Name)"] = "OSS Support;Jon Ward" or [#"To: (Name)"] = "OSS Support;Nabil Awan" or [#"To: (Name)"] = "OSS Support;Onboarding" or [#"To: (Name)"] = "OSS Support;Ricky Tailor;Elite" or [#"To: (Name)"] = "OSS Support;Rob Sturgeon" or [#"To: (Name)"] = "OSS Support;Windows Production" or [#"To: (Name)"] = "Pavla Mikulikova;OSS Support" or [#"To: (Name)"] = "Purchasing Team;Alexandra Giles;Joanna Fardon;Nabil Awan;OSS Support;Rob Sturgeon;Marketing" or [#"To: (Name)"] = "Rebecca Williams;OSS Support" or [#"To: (Name)"] = "Ricky Tailor;Chris Page;Nabil Awan;OSS Support;Ania Richards;Mike Mounter" or [#"To: (Name)"] = "Ricky Tailor;EasiAdmin" or [#"To: (Name)"] = "Ricky Tailor;James Harley;Elite;OSS Support" or [#"To: (Name)"] = "Ricky Tailor;Nabil Awan;OSS Support;Marketing" or [#"To: (Name)"] = "Ricky Tailor;OSS Support" or [#"To: (Name)"] = "Ricky Tailor;OSS Support;Claire Cunnick" or [#"To: (Name)"] = "Robert Bruce;Elite;OSS Support" or [#"To: (Name)"] = "Robert Bruce;OSS Support" or [#"To: (Name)"] = "Ryan Litchfield;Ricky Tailor;OSS Support" or [#"To: (Name)"] = "Sarah Darbee;1.Origin IT Support;EasiAdmin" or [#"To: (Name)"] = "Sarah Darbee;OSS Support" or [#"To: (Name)"] = "Sierra Tommas;Nabil Awan;OSS Support;R&D" or [#"To: (Name)"] = "Sierra Tommas;OSS Support" or [#"To: (Name)"] = "Sierra Tommas;OSS Support;R&D" or [#"To: (Name)"] = "Solutions;Elite;OSS Support" or [#"To: (Name)"] = "Solutions;Elite;Windows Production;OSS Support" or [#"To: (Name)"] = "Tapiwa Tutisani;Louise Collis;OSS Support" or [#"To: (Name)"] = "Windows Production;OSS Support;Solutions")),

#"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each not Text.Contains([Subject], "RE: ")),

#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows1",{"To: (Name)"}),

#"Added Custom" = Table.AddColumn(#"Removed Columns", "Emails", each [Subject] & " " & [Body]),

#"Removed Columns1" = Table.RemoveColumns(#"Added Custom",{"Subject", "Body"}),

#"Cleaned Text" = Table.TransformColumns(#"Removed Columns1",{{"Emails", Text.Clean, type text}}),

#"Remove all but latest message" = Table.TransformColumns(#"Cleaned Text", {{"Emails", each Text.BeforeDelimiter(_, "From: "), type text}}),

#"Replaced Value" = Table.ReplaceValue(#"Remove all but latest message","<http://origin-global.com/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value62" = Table.ReplaceValue(#"Replaced Value","?","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value60" = Table.ReplaceValue(#"Replaced Value62","!","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value59" = Table.ReplaceValue(#"Replaced Value60","#","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value57" = Table.ReplaceValue(#"Replaced Value59","*","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value56" = Table.ReplaceValue(#"Replaced Value57","""","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value56","0808 168 5816","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value2" = Table.ReplaceValue(#"Replaced Value1","orderssouth@origin-global.com","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","<mailto:orderssouth@origin-global.com>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value4" = Table.ReplaceValue(#"Replaced Value3","www.origin-global.com","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value5" = Table.ReplaceValue(#"Replaced Value4","<http://origin-global.com/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value6" = Table.ReplaceValue(#"Replaced Value5","Origin Global,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value7" = Table.ReplaceValue(#"Replaced Value6","Sands 10 Industrial Estate,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value8" = Table.ReplaceValue(#"Replaced Value7","Hillbottom Road,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value9" = Table.ReplaceValue(#"Replaced Value8","High Wycombe,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value10" = Table.ReplaceValue(#"Replaced Value9","Buckinghamshire","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value11" = Table.ReplaceValue(#"Replaced Value10","HP12 4HS","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value12" = Table.ReplaceValue(#"Replaced Value11","<http://www.houzz.co.uk/pro/origin-global/origin-global>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value13" = Table.ReplaceValue(#"Replaced Value12","<https://www.pinterest.com/originbifolds/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value14" = Table.ReplaceValue(#"Replaced Value13","<https://www.linkedin.com/company/origin-frames>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value15" = Table.ReplaceValue(#"Replaced Value14","<https://instagram.com/origin_global/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value16" = Table.ReplaceValue(#"Replaced Value15","<http://www.originbifolds.com/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value17" = Table.ReplaceValue(#"Replaced Value16","<http://originuae.com/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value18" = Table.ReplaceValue(#"Replaced Value17","Get Outlook for iOS","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value19" = Table.ReplaceValue(#"Replaced Value18","<https://aka.ms/o0ukef>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value20" = Table.ReplaceValue(#"Replaced Value19","<https://www.facebook.com/OriginFramesUK/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value21" = Table.ReplaceValue(#"Replaced Value20","<https://twitter.com/originbifolds>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value22" = Table.ReplaceValue(#"Replaced Value21","<http://origin-global.com/bi-fold-doors>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value23" = Table.ReplaceValue(#"Replaced Value22","<http://origin-global.com/aluminium-windows>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value24" = Table.ReplaceValue(#"Replaced Value23","<http://origin-global.com/electric-roller-blinds>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value30" = Table.ReplaceValue(#"Replaced Value24","<https://www.facebook.com/OriginFramesUK/>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value25" = Table.ReplaceValue(#"Replaced Value30","<http://www.ribaproductselector.com/origin-global/29946/overview.aspx>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value26" = Table.ReplaceValue(#"Replaced Value25","Get Outlook for Android","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value27" = Table.ReplaceValue(#"Replaced Value26","<https://aka.ms/ghei36>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value28" = Table.ReplaceValue(#"Replaced Value27","<image001.jpg>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value29" = Table.ReplaceValue(#"Replaced Value28","(941) 484 - 4861","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value31" = Table.ReplaceValue(#"Replaced Value29","<http://www.originbifolds.com/bi-fold-doors>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value32" = Table.ReplaceValue(#"Replaced Value31","<http://www.originbifolds.com/aluminum-windows>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value33" = Table.ReplaceValue(#"Replaced Value32","Stuart House,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value34" = Table.ReplaceValue(#"Replaced Value33","Castle Estate,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value35" = Table.ReplaceValue(#"Replaced Value34","Coronation Road,","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value36" = Table.ReplaceValue(#"Replaced Value35","HP12 3TA","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value37" = Table.ReplaceValue(#"Replaced Value36","<https://www.facebook.com/OriginBifoldsUSA>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value38" = Table.ReplaceValue(#"Replaced Value37","<https://twitter.com/Originbifolds>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value39" = Table.ReplaceValue(#"Replaced Value38","<http://www.houzz.co.uk/pro/originbifolds/origin-global>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value40" = Table.ReplaceValue(#"Replaced Value39","Origin USA will be closed for Christmas Holiday starting December 21, 2018 through January 1, 2019. We will reopen on January 2, 2019.","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value41" = Table.ReplaceValue(#"Replaced Value40","Origin USA Inc., 700 Commerce Drive, Venice, Florida, 34292","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value42" = Table.ReplaceValue(#"Replaced Value41","(941) 484 - 4861","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value43" = Table.ReplaceValue(#"Replaced Value42","08448 802 378","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value44" = Table.ReplaceValue(#"Replaced Value43","01494 418 493","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value45" = Table.ReplaceValue(#"Replaced Value44","<http://www.originbifolds.com>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value67" = Table.ReplaceValue(#"Replaced Value45","<https://www.originbifolds.com>","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value66" = Table.ReplaceValue(#"Replaced Value67","www.originbifolds.com","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value46" = Table.ReplaceValue(#"Replaced Value66","[Origin | DOORS AND WINDOWS]","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value47" = Table.ReplaceValue(#"Replaced Value46","Sent from my Samsung Galaxy smartphone.","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value48" = Table.ReplaceValue(#"Replaced Value47",",","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value49" = Table.ReplaceValue(#"Replaced Value48","'","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value50" = Table.ReplaceValue(#"Replaced Value49","-","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value52" = Table.ReplaceValue(#"Replaced Value50","Hi Team","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value53" = Table.ReplaceValue(#"Replaced Value52","Hi all","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value54" = Table.ReplaceValue(#"Replaced Value53","Hi guys","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value55" = Table.ReplaceValue(#"Replaced Value54","Hiya","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value58" = Table.ReplaceValue(#"Replaced Value55",".","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value51" = Table.ReplaceValue(#"Replaced Value58","Hi ","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value61" = Table.ReplaceValue(#"Replaced Value51","/"," ",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value63" = Table.ReplaceValue(#"Replaced Value61","Many thanks","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value64" = Table.ReplaceValue(#"Replaced Value63","Thanks","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value65" = Table.ReplaceValue(#"Replaced Value64","08448 802 378","",Replacer.ReplaceText,{"Emails"}),

#"Removed Columns2" = Table.RemoveColumns(#"Replaced Value65",{"CC: (Name)"}),

"Reordered Columns1" = Table.ReorderColumns(#"Removed Columns2",{"Emails", "From: (Name)"}),

#"Replaced Value75" = Table.ReplaceValue(#"Replaced Value74","Thank you","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value76" = Table.ReplaceValue(#"Replaced Value75","Please help me","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value77" = Table.ReplaceValue(#"Replaced Value76","Best regards","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value78" = Table.ReplaceValue(#"Replaced Value77","Kind regards","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value79" = Table.ReplaceValue(#"Replaced Value78","""","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value80" = Table.ReplaceValue(#"Replaced Value79","`","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value81" = Table.ReplaceValue(#"Replaced Value80","URGENT","",Replacer.ReplaceText,{"Emails"}),

#"Replaced Value82" = Table.ReplaceValue(#"Replaced Value81","'","",Replacer.ReplaceText,{"Emails"})

in

#"Replaced Value82"

Filter out From full name from from email

Split words

Filter emails

Filter numbers

Filter names

Colin-Codes commented 5 years ago

Alternatively it could be much easier just to flag up unrecognised words to be added or ignored, on import for future training.