choose optimal features and parameters for predicting PLOD

subburamr commented 8 years ago

Kernel: 1. polykernel

rbfkernel Attributes: Removed attributes placeOfBirth, placeOfDeath, placeOfLastVisit and age Parameters: Default

3.attributeranking.txt 3.polykernel.txt 3.rbfkernel.txt.gz

subburamr commented 8 years ago

Adding some results for subset of instances

Attributes used:(Attributes with >50% missing removed) Attributes: 16 name title male culture house book1 book2 book3 book4 book5 isNoble numDeadRelations boolDeadRelations isPopular popularity isAlive

Dataset 1. Instances - Only with attribute IsPopular = 1(with normalized popularity score > 0.34) No of instances = 115 === Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.691     0.267      0.704     0.691     0.697      0.788    1
             0.733     0.309      0.721     0.733     0.727      0.792    0

Weighted Avg. 0.713 0.289 0.713 0.713 0.713 0.79

=== Confusion Matrix ===

a b <-- classified as 38 17 | a = 1 16 44 | b = 0 1.OnlyPopular_poly.txt 1.OnlyPopular_rbf.txt

Dataset 2. Either Popular or has title No of instance = 971 === Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.823     0.454      0.818     0.823     0.82       0.756    1
             0.546     0.177      0.556     0.546     0.551      0.756    0

Weighted Avg. 0.744 0.374 0.742 0.744 0.743 0.756

=== Confusion Matrix ===

a b <-- classified as 569 122 | a = 1 127 153 | b = 0 2.OnlyPopOrTitle_poly.txt 2.OnlyPopOrTitle_rbf.txt

Dataset 3. Instances = Either (popular or has title) and (has culture or has house ) No of instances = 850

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.754     0.435      0.808     0.754     0.78       0.725    1
             0.565     0.246      0.486     0.565     0.522      0.725    0

Weighted Avg. 0.699 0.38 0.714 0.699 0.705 0.725

=== Confusion Matrix ===

a b <-- classified as 454 148 | a = 1 108 140 | b = 0

3.OnlyPopOrTitleandHorC_poly.txt 3.OnlyPoporTitleandHorC_rbf.txt

gyachdav commented 8 years ago

check this out for more feature that were tested:

http://www.techtimes.com/articles/16812/20140930/can-statistics-tell-us-who-dies-next-on-game-of-thrones.htm

goldbergtatyana commented 8 years ago

As i can see, the result of dataset 1 is the leading one with the F-measure of 0.727. However, the data set of it is rather small (115 characters)... For each dataset, how do you compare yourself to random when predicting new dead characters? Have you tried other ML classification algorithms?

subburamr commented 8 years ago

Below are the result with Randomforest, which was good for smaller dataset but didn't give a good result for a larger dataset. Random Forest Dataset 2: Either Popular or has attribute title === Summary ===

Correctly Classified Instances 717 73.8414 % Incorrectly Classified Instances 254 26.1586 % Kappa statistic 0.1566 Mean absolute error 0.3783 Root mean squared error 0.4226 Relative absolute error 92.1235 % Root relative squared error 93.281 % Total Number of Instances 971

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.983     0.864      0.737     0.983     0.842      0.738    1
             0.136     0.017      0.76      0.136     0.23       0.738    0

Weighted Avg. 0.738 0.62 0.744 0.738 0.666 0.738 === Confusion Matrix ===

a b <-- classified as 679 12 | a = 1 242 38 | b = 0

Naive bayes: - the results were not better than SVM Dataset 2: Either Popular or has attribute title === Summary ===

Correctly Classified Instances 721 74.2533 % Incorrectly Classified Instances 250 25.7467 % Kappa statistic 0.2965 Mean absolute error 0.2825 Root mean squared error 0.4657 Relative absolute error 68.7948 % Root relative squared error 102.7997 % Total Number of Instances 971

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.893     0.629      0.778     0.893     0.832      0.703    1
             0.371     0.107      0.584     0.371     0.454      0.703    0

Weighted Avg. 0.743 0.478 0.722 0.743 0.723 0.703

=== Confusion Matrix ===

a b <-- classified as 617 74 | a = 1 176 104 | b = 0

One observation when training with different datasets for SVM was that, the dead characters which get misclassified are mostly the same values. For eg: In the previous result, dataset2(127 dead misclassified ) and dataset3(108 dead misclassified), had 102 values in common. And even when trying with different combination of attributes, most of the values still remain misclassified.

However removing them from dataset gives a good result. Dataset 2: Either Popular or has attribute title, misclassified instances removed. === Summary ===

Correctly Classified Instances 753 90.942 % Incorrectly Classified Instances 75 9.058 % Kappa statistic 0.649 Mean absolute error 0.094 Root mean squared error 0.274 Relative absolute error 32.8436 % Root relative squared error 72.495 % Total Number of Instances 828

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.971     0.385      0.924     0.971     0.947      0.941    1
             0.615     0.029      0.815     0.615     0.701      0.941    0

Weighted Avg. 0.909 0.323 0.905 0.909 0.904 0.941

=== Confusion Matrix ===

a b <-- classified as 665 20 | a = 1 55 88 | b = 0

goldbergtatyana commented 8 years ago

Yep, given that dead characters make 17% of your data set, the prediction with Naive Bayes for dataset2 is much better than random (62%, 88 correct out of 143). What are the misclassified instances? Do they share a common (missing?) set of features?

subburamr commented 8 years ago

For the misclassified dead characters(137), some patterns were missing values for culture(53) and most were with Title "Ser"(64). However removing culture attribute or removing other attributes one by one did not reduce these misclassifications much. Below is the list of dead characters who get misclassified in the different datasets. dead_misclassified.xlsx

The result in the previous comment(dataset2 with misclassified instances removed) was with SMO polynomial kernel and i noticed that SMO itself provided better results than other classification algorithms. Here are some results with other algorithms on the full dataset. results_other_algorithm_fulldataset.txt

So far the best result was obtained when using SMO polykernel along with thresholdSelector to automatically optimize the F-measure value for class “dead” and setting Replace missing values with mean or median. For the SMO kernel, changing the C value from 1 to 8 improved the filtered dataset but reduced the accuracy of the full dataset.

Full Dataset === Summary ===

Correctly Classified Instances 1433 73.6382 % Incorrectly Classified Instances 513 26.3618 % Kappa statistic 0.3544 Mean absolute error 0.32
Root mean squared error 0.4307 Relative absolute error 84.3261 % Root relative squared error 98.902 % Total Number of Instances 1946

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.784     0.402      0.851     0.784     0.816      0.76     1
             0.598     0.216      0.485     0.598     0.536      0.76     0

Weighted Avg. 0.736 0.355 0.758 0.736 0.745 0.76

=== Confusion Matrix ===

a    b   <-- classified as

1137 314 | a = 1 199 296 | b = 0

Filtered Dataset: Only instances with "Title" or Popular. === Summary ===

Correctly Classified Instances 717 73.8414 % Incorrectly Classified Instances 254 26.1586 % Kappa statistic 0.396 Mean absolute error 0.2794 Root mean squared error 0.4338 Relative absolute error 68.0424 % Root relative squared error 95.7574 % Total Number of Instances 971

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.779     0.361      0.842     0.779     0.809      0.79     1
             0.639     0.221      0.539     0.639     0.585      0.79     0

Weighted Avg. 0.738 0.321 0.755 0.738 0.744 0.79

=== Confusion Matrix ===

a b <-- classified as 538 153 | a = 1 101 179 | b = 0 Poly_ThresholdSelector_fulldataset.txt Poly_ThresholdSelector_PopOrTitle_dataset.txt

Below are the attributes which were used Attribute Ranking 0.09480986639259702 13 book4 0.07291363981192191 8 house 0.05979625565956465 4 culture 0.052158273381294196 14 book5 0.043165467625899026 20 isNoble 0.043165467625899026 3 male 0.03491544426796254 2 title 0.028145764263670943 21 age 0.027749229188078366 12 book3 0.024768756423432896 19 isMarried 0.01783144912641327 18 isAliveSpouse 0.01505652620760542 11 book2 0.013514902363823281 23 boolDeadRelations 0.01228160328879757 10 book1 0.010739655549845857 25 popularity 0.004367934224049325 16 isAliveFather 0.0038540596094552874 24 isPopular 0.0036485097636176724 15 isAliveMother 0.0034943473792394615 17 isAliveHeir 0.0030524152106885974 22 numDeadRelations 0.0010731507869434253 9 spouse 0.0001523361764684993536 6 father 0.0001518796244910712576 5 mother 0.00010044143503417536 7 heir

PLODs Plod_full_and_filtered_data.xlsx

goldbergtatyana commented 8 years ago

Of all the attempts listed here to find an optimal model, the SMO - polykernel model on the full data set is the best one. I think we could stop here and use the results of this model for got.show.

What I did not understand, however, why are the results of this model are different between the post of 4 days ago:

=== Confusion Matrix ===

a    b   <-- classified as

1129 322 | a = 1 217 278 | b = 0

and the one from one day ago:

a b <-- classified as 1137 314 | a = 1 199 296 | b = 0

sacdallago commented 8 years ago

This discussion exceeds my knowledge, but the title of this issue is:

choose optimal features and parameters for predicting PLOD

Which is definitely something that should be closed after a feature freeze.

goldbergtatyana commented 8 years ago

Yes, the deliver is tomorrow - no improvements should be done after tomorrow. However, today you @subburamr could try one more thing:

Take the predictions on the full data set of your final model. Each character has a prediction of being DEAD and the corresponding PLOD. As of now (i.e. by default), the PLOD of 50 discriminates dead from alive ones. Will the performance of your model improve if you would lower the PLOD?

subburamr commented 8 years ago

Below are the differences between the latest result and the first result.

Attribute Age was used instead of DateOfBirth.
Data preprocessed by removing missing values with mean or median.
The option ThresholdSelector was used along with SMO. I had assumed this would modify the threshold value automatically, but I noticed from the result that the threshold value is still 0.5. We will try to manually change the threshold and post the results here.

@sacdallago Since we were getting feedback on the results and trying to choose which features to use among our collected features, I had not closed this issue. However we have not added any new feature, not even pushed a new commit since the feature freeze date :smiley:

subburamr commented 8 years ago

@goldbergtatyana By reducing the threshold value, the model shows improvement in classifying number of dead people.

Here is the summary with threshold value 35%, testing with threshold lower than this seems to affect the overall performance.

Summary with threshold value 35% Correctly classified Instances 71.0688591984 % Incorrectly classified Instances 28.9311408016 %

=== Confusion Matrix === Prediction 1053 398 Alive 165 330 Dead

######### Classification Report ######### precision recall f1-score support Dead 0.45 0.67 0.54 495 Alive 0.86 0.73 0.79 1451 avg / total 0.76 0.71 0.73 1946

Below file contains reports for the different threshold values. plod_varying_thresholds.txt

sacdallago commented 8 years ago

@subburamr :D well, if you found a better predictor, you can still push that but then close this issue (by today) and rest at least 'til the 10th of April! You can go on with this discussion in the future, opening a new issue "chose better predictions then v1.0.0" and you can start working on this repo as much as you wish :dancer: https://github.com/Rostlab/JS16_ProjectE/issues/19

goldbergtatyana commented 8 years ago

@subburamr the improvement in classifying dead is at the cost of correctly classified alive. Therefore, the default threshold of 50 should be the one to use. Please forward your results (a function that provides a PLOD for each character) _today _ to group A and please write here a short summary of how you have developed your prediction model.

subburamr commented 8 years ago

Summary of the prediction model has been documented here

goldbergtatyana commented 8 years ago

@subburamr In the final prediction model, the "Ranking of Attributes using Relief F score" lists 24 features. Is this your total number of features and not 26 as written in the description?

Also, what do abbreviations stand for:

FfC
DwD
SoS
CoK
GoT

Finally, what do exactly "related to dead" and "number dead relations" mean?

Thank you!

subburamr commented 8 years ago

Yes, the number of attributes used is 24. I have now updated the description. In the weka result, name and isAlive label are added as attributes making it 26, so we had reported that we were using 26 attributes. However name is removed via Remove filter before running SVM classifier and used only for reporting the probability of class. Final prediction output for reference: final_output.txt

GoT, CoK, SoS, FfC, DwD are the abbreviations for the 5 books Game of Thrones, Clash of Kings etc. and they represent whether a character has appeared in a specific book ( value 1 - if appeared, else 0).

numDeadRelations indicates the number of dead characters to whom a character is related. (eg: Arya Stark - 8, Varys - 0)
isRelatedToDead is a boolean. Value 1- if a character is related to any dead character, 0 if not related to any dead character. (eg: Arya Stark - 1, Varys - 0) Since both attributes had a good ranking and had some variance between them, we retained both the attributes.

Rostlab / JS16_ProjectB_Group6

choose optimal features and parameters for predicting PLOD #54