Colin-Codes / IntentClassifier-ML-Project

Pyhton, Keras, SciKit-Learn, Matplotlib: Machine learning research project around classification of intent behind tech support emails in order to enable automatic follow up.
0 stars 0 forks source link

Generate test set (November) #37

Closed Colin-Codes closed 4 years ago

Colin-Codes commented 4 years ago

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1337&rep=rep1&type=pdf

How to split test train

Original class distribution: Class Account 372 Error 115 Information 80 Documents 44 Pricing 42 Forward 38 Gables 37 Action 32 Reminder 26 Admin 25 Template 23 Delivery 15 Report 15 Logo 14 Weight 13 Colour 11 Feedback 11 Availability 10 EqualGlass 7 Leaver 6 Status 6 Access 6 Authorisation 5 Callback 4 Project 3 dtype: int64 Test class distribution: Class Account 74 Error 23 Information 16 Documents 9 Forward 8 Pricing 8 Gables 7 Action 6 Admin 5 Reminder 5 Template 5 Weight 3 Logo 3 Report 3 Delivery 3 Colour 2 Availability 2 Feedback 2 EqualGlass 2 Callback 1 Authorisation 1 Leaver 1 Project 1 Status 1 Access 1 dtype: int64 Training class distribution: Class Account 298 Error 92 Information 64 Documents 35 Pricing 34 Gables 30 Forward 30 Action 26 Reminder 21 Admin 20 Template 18 Delivery 12 Report 12 Logo 11 Weight 10 Colour 9 Feedback 9 Availability 8 EqualGlass 5 Leaver 5 Status 5 Access 5 Authorisation 4 Callback 3 Project 2 dtype: int64 Balanced class distribution: Class Weight 298 Error 298 Account 298 Action 298 Admin 298 Authorisation 298 Availability 298 Callback 298 Colour 298 Delivery 298 Documents 298 EqualGlass 298 Feedback 298 Template 298 Forward 298 Gables 298 Information 298 Leaver 298 Logo 298 Pricing 298 Project 298 Reminder 298 Report 298 Status 298 Access 298 dtype: int64 generated augmented sentences with eda for ../../data/trainingSet_balanced.csv to ../../data/trainingSet_augmented.csv with num_aug=4 Augmented class distribution: Class Weight 1490 Error 1490 Account 1490 Action 1490 Admin 1490 Authorisation 1490 Availability 1490 Callback 1490 Colour 1490 Delivery 1490 Documents 1490 EqualGlass 1490 Feedback 1490 Template 1490 Forward 1490 Gables 1490 Information 1490 Leaver 1490 Logo 1490 Pricing 1490 Project 1490 Reminder 1490 Report 1490 Status 1490 Access 1490 dtype: int64

Colin-Codes commented 4 years ago

I'm worried about over-fitting as these results are a bit convenient...

image Top to bottom: RNN, LSTM, GRU