Open s-feng opened 8 years ago
@goldbergtatyana
28 alive characters who are predicted to be dead?? Who are they? :)))
The result looks very good, indeed. How many characters did you remove, what are the features you're using? How does your performance look if you compare yourself to random?
Did you also try other ML methods? The speed btw is very much unimportant, since we're running ML algorithm only once. Important are results!
Awesome would be if you would provide us with constant updates of your results here!
Just tried Naive Bayes. Got a result like this:
Time taken to build model: 0.01 seconds
=== Stratified cross-validation === === Summary ===
Correctly Classified Instances 819 86.0294 % Incorrectly Classified Instances 133 13.9706 % Kappa statistic 0.7045 Mean absolute error 0.177 Root mean squared error 0.3044 Relative absolute error 42.335 % Root relative squared error 66.5947 % Total Number of Instances 952
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.804 0.007 0.996 0.804 0.89 0.993 alive
0.993 0.196 0.682 0.993 0.809 0.993 dead
Weighted Avg. 0.86 0.063 0.903 0.86 0.866 0.993
=== Confusion Matrix ===
a b <-- classified as 538 131 | a = alive 2 281 | b = dead
@dan736923 @konstantinos-angelo @nicoladesocio
what filter you use? the one that is in the develop branch?
@s-feng btw if you try classificationViaRegression (in the meta folder) it has really impressive results. @goldbergtatyana probaily it overfits the data? === Stratified cross-validation === === Summary ===
Correctly Classified Instances 959 99.7919 % Incorrectly Classified Instances 2 0.2081 % Kappa statistic 0.9951 Mean absolute error 0.026 Root mean squared error 0.0643 Relative absolute error 6.1641 % Root relative squared error 14.001 % Total Number of Instances 961
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0.007 0.997 1 0.999 0.995 alive
0.993 0 1 0.993 0.997 0.995 dead
Weighted Avg. 0.998 0.005 0.998 0.998 0.998 0.995
=== Confusion Matrix ===
a b <-- classified as 671 0 | a = alive 2 288 | b = dead
Name | Missing |
---|---|
dateOfBirth | 61% |
dateOfDeath | 72% |
culture | 64% |
house | 11% |
title | 43% |
father | 97% |
mother | 98% |
heir | 98% |
house_founded | 90% |
num_houses_overlord | 100% |
age | 63% |
ageGroup | 63% |
gender | 0% |
score | 0% |
links | 0% |
connections | 0% |
hasHeir | 0% |
hasHeirAlive | 1% |
hasTitle | 10% |
hasHouse | 0% |
hasSpouse | 0% |
isSpouseAlive | 6% |
isNoble | 0% |
multipleBooks | 0% |
status | 0% |
Originally we had a dataset of 2000. By a filter now we have a 952 inputs. But if we try the naive-bayes and RBFNetwork on the original 2000 inputs, the prediction accuracy is still good: RBFN: === Confusion Matrix ===
a b <-- classified as 1398 75 | a = alive 33 431 | b = dead
Naivebayes: === Confusion Matrix ===
a b <-- classified as
1320 153 | a = alive 2 462 | b = dead
name: 'galazza galare' PLOD: 49.8% predicted as: dead name: 'tormund' PLOD: 49.3% predicted as: dead name: 'howland reed' PLOD: 38% predicted as: dead name: 'tyrion lannister' PLOD: 34.2% predicted as: dead name: 'ardrian celtigar' PLOD: 19.6% predicted as: dead name: 'olenna redwyne' PLOD: 18.9% predicted as: dead name: 'robert arryn' PLOD: 15.1% predicted as: dead name: 'jaime lannister' PLOD: 13.700000000000001% predicted as: dead
The classification via regression has no alive predicted as dead. looks like over-estimated with too high order.
@s-feng What are the most important attributes for the character subset btw ?
@s-feng tried with more charachters and more validation, and still performs very good. there is also one alive predicted as dead. however it's true that the alive predicted as dead are less than the dead predicted as alive. this is done with 1761 charachters and 20 folds on cross validation === Summary ===
Correctly Classified Instances 1745 99.0914 % Incorrectly Classified Instances 16 0.9086 % Kappa statistic 0.976 K&B Relative Info Score 165601.9959 % K&B Information Score 1357.7942 bits 0.771 bits/instance Class complexity | order 0 1443.9304 bits 0.8199 bits/instance Class complexity | scheme 123.3226 bits 0.07 bits/instance Complexity improvement (Sf) 1320.6078 bits 0.7499 bits/instance Mean absolute error 0.0257 Root mean squared error 0.095 Relative absolute error 6.7578 % Root relative squared error 21.7782 % Total Number of Instances 1761
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.997 0.027 0.991 0.997 0.994 0.994 alive
0.973 0.003 0.991 0.973 0.982 0.994 dead
Weighted Avg. 0.991 0.021 0.991 0.991 0.991 0.994
=== Confusion Matrix ===
a b <-- classified as
1307 4 | a = alive 12 438 | b = dead
PS: how do you compute the plod?
Here are the results for Naive Bayes. Data set consists of 954 characters generated with the latest version of createARFF.js. See zip file for details.
=== Summary === Correctly Classified Instances 789 82.7044 % Incorrectly Classified Instances 165 17.2956 % Kappa statistic 0.6398 Mean absolute error 0.207 Root mean squared error 0.3306 Relative absolute error 49.3779 % Root relative squared error 72.2279 % Total Number of Instances 954
=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.768 0.035 0.981 0.768 0.862 0.98 alive 0.965 0.232 0.64 0.965 0.769 0.98 dead Weighted Avg. 0.827 0.094 0.879 0.827 0.834 0.98
=== Confusion Matrix === a b <-- classified as 514 155 | a = alive 10 275 | b = dead
RANK NAME PLOD
Attributes contribution: I removed dateOfBirth, father, mother, heir, num_houses_overlord since they had a negative ranking. Then I reevaluated the attributes and got these results: dateOfDeath: 0.11797, house: 0.1002, gender: 0.05252, culture: 0.03868, age: 0.02446, ageGroup: 0.02282, hasTitle: 0.02002, hasHeir: 0.01981, hasHouse: 0.01855, isSpouseAlive: 0.01462, hasSpouse: 0.01289, title: 0.01258, score: 0.01242, links: 0.01229, connections: 0.01229, multipleBooks: 0.00901, hasHeirAlive: 0.00587, isNoble: 0.00503, house_founded: 0.00341
Isnt it strange that we have alive classified as dead but the plod is < 0.5 ? Shouldnt it be above that threshold ?
Also can you retrain without dateofdeath ?
=== Summary === Correctly Classified Instances 632 66.2474 % Incorrectly Classified Instances 322 33.7526 % Kappa statistic 0.1846 Mean absolute error 0.3561 Root mean squared error 0.4623 Relative absolute error 84.9621 % Root relative squared error 101.0062 % Total Number of Instances 954
=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.768 0.586 0.755 0.768 0.761 0.689 alive 0.414 0.232 0.432 0.414 0.423 0.689 dead Weighted Avg. 0.662 0.48 0.658 0.662 0.66 0.689
=== Confusion Matrix === a b <-- classified as 514 155 | a = alive 167 118 | b = dead
That certainly is interesting. I am not sure that we should use dateofdeath in our results since this seems like we overfit our data. There is almost a 1-1 connnection between dod and plod.
We can retry regression i guess and see again what happens.
Generally my feeling is that we have a good result when we classify as few dead people as alive as possible. So we confirm that dead are dead and then whatever discrepancy we have with alive classified as dead is what we are looking for
Correction : we should NOT use date of death for our calculation
Evaluated attributes by using InfoGainAttributeEval, got this ranking:
Ranked attributes: 0.245386285658262144 4 house 0.102228084493921184 5 title 0.04709825838396464 1 dateOfBirth 0.027357430748710332 3 culture 0.020562054117495676 13 gender 0.019577176299923616 15 links 0.019577176299923616 16 connections 0.016114926878331604 14 score 0.014613599308347936 19 hasTitle 0.011421676273234316 17 hasHeir 0.010023205029333404 23 isNoble 0.007056916728209296 11 age 0.005446717931224088 18 hasHeirAlive 0.004316973941269065 22 isSpouseAlive 0.003776338882749642 12 ageGroup 0.001812264753613624 9 house_founded 0.000399081592559747 21 hasSpouse 0.000391338498300642 24 multipleBooks 0.000191120232670539 20 hasHouse 0.000000000000001665 7 mother 0.000000000000001665 8 heir 0 2 dateOfDeath 0 10 num_houses_overlord -0.000000000000000333 6 father
Start removing the attribute one after another from the bottom of the ranking. Skipped dateofDeath. The accuracy won't change after removed father, num_houses_overlord, heir, mother, has house. These five have about 99% missing rate.
Nice! I like the fact that house affiliation has such a strong impact on the prediction.
what's the feature named "score"? what does it stand for?
On Mar 21, 2016, at 11:30 AM, Shuo Feng notifications@github.com wrote:
Evaluated attributes by using InfoGainAttributeEval, got this ranking:
Ranked attributes: 0.245386285658262144 4 house 0.102228084493921184 5 title 0.04709825838396464 1 dateOfBirth 0.027357430748710332 3 culture 0.020562054117495676 13 gender 0.019577176299923616 15 links 0.019577176299923616 16 connections 0.016114926878331604 14 score 0.014613599308347936 19 hasTitle 0.011421676273234316 17 hasHeir 0.010023205029333404 23 isNoble 0.007056916728209296 11 age 0.005446717931224088 18 hasHeirAlive 0.004316973941269065 22 isSpouseAlive 0.003776338882749642 12 ageGroup 0.001812264753613624 9 house_founded 0.000399081592559747 21 hasSpouse 0.000391338498300642 24 multipleBooks 0.000191120232670539 20 hasHouse 0.000000000000001665 7 mother 0.000000000000001665 8 heir 0 2 dateOfDeath 0 10 num_houses_overlord -0.000000000000000333 6 father
Start removing the attribute one after another from the bottom of the ranking. Skipped dateofDeath. The accuracy won't change after removed father, num_houses_overlord, heir, mother, has house. These five have about 99% missing rate.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
Links, connections and score all come from the pagerank algorithm. Links are the normalized number of links leading to the page. Connections are similar to links but it doesnt take into account multiple links from other pages ( you either have a connection or you don't and the actual amount of links is irrelevant). Score is basically a normalization of the score attribute of the pagerank data.
Very nice discussion! Comments so far:
check this out for more feature that were tested:
The article is nice but as far as I understood the only feature tested was chapters per book and this is something not currently in our DB. Plus it deals only with major characters that have dedicated PoV chapters, while we deal with pretty much the whole GoT universe. I do not know how applicable that is in our case. We can research a bit more features that others tried to apply
True. Good observation. Thanks for pointing it out.
Sent from my iPhone
On Mar 21, 2016, at 4:49 PM, Kostas Angelo notifications@github.com wrote:
The article is nice but as far as I understood the only feature tested was chapters per book and this is something not currently in our DB. Plus it deals only with major characters that have dedicated PoV chapters, while we deal with pretty much the whole GoT universe. I do not know how applicable that is in our case. We can research a bit more features that others tried to apply
— You are receiving this because you commented. Reply to this email directly or view it on GitHub
Hi Group 7,
I overlooked the test4.zip file for Naive Bayes results from yesterday, sorry about that. Naive Bayes (the simplest ML algorithm ever :)) seems to do a really good job and should probably be the algorithm of your choice. To the PLODs: for each character, Weka provides a probability score for each of two classes. The probability from the second column is the PLOD score. For example, the PLOD score for 'raymun fossoway' from the file in the zip folder is 27% (please round % to integer values) and this is the value that needs to be forwarded to group A.
Btw, with the feature set that you selected seems the predictions of dead characters seems to not change. For the data set of 2+K I can see that you predict 153 characters as dead, while on the smaller set this number is 131. Do these 131 form a subset of 153? If yes, then please go ahead and use the predictions on the full data set for forwarding them to A. Oh and how do the features of the few dead characters that are predicted to be alive look? Can it be that these are the characters who died but respawned again? If so, wow very cool!
So, I corrected a bug in parseWekaOutput.js. PLODs are now calculated correctly. :D I repeated the Weka runs without the minor attributes and dateOfDeath using a) the set of characters filtered by popularity (954), b) the unfiltered data set (1939). (No characters with 'house of ...', etc included) So far NaiveBayes seems to be the best option (see zip file). test5.zip
=== Confusion Matrix === (popularity filter) a b <-- classified as 514 155 | a = alive 167 118 | b = dead
Top PLODs (popularity filter)
=== Confusion Matrix === (no filter) a b <-- classified as 1299 174 | a = alive 333 133 | b = dead
Top PLODs (no filter)
Finally some characters I know. :) Many of the characters in the top of the [no filter] list are predicted to be alive in the [popularity filter] run. As for the characters, who are actually alive and predicted to be dead: The ones in [popularity filter] are not a subset of the ones in [no filter]. (69 missing)
@dan736923 thanks for correcting the bug, though it seems it was not the last one... ;-)
Comparing old results with the new ones, you can see that the results for alive characters did not change through the transition from old to new, however the results for dead characters did -> there are so many more dead misclassified as alive. Why did it happen?
NB NEW - full set a b <-- classified as 1299 174 | a = alive 333 133 | b = dead
NB OLD - full set a b <-- classified as 1320 153 | a = alive 2 462 | b = dead
NB NEW - filtered set a b <-- classified as 514 155 | a = alive 167 118 | b = dead
NB OLD - filtered set a b <-- classified as 514 155 | a = alive 10 275 | b = dead
@goldbergtatyana I think a lot of this has to do that old datasets where incorrectly trained with the dateofdeath attribute which is an almost 1 to 1 death indicator (overfit ?).
These last predictions are more or less our true predictions since we started training without dod. I am mostly interested in what dead characters predicted as alive share in common.
@konstantinos-angelo Agree, there was a one to one relationship between dod and the prediction. So, the last model you provided and that was developed on the full dataset is the one to use!
Since we have one more day until the delivery deadline, I would love having you try one more thing:
Each character in your prediction set has a PLOD assigned. By default, the PLOD of 50 discriminates between dead and alive characters. Would the performance of your model change, if you lower the PLOD (i.e. have more characters predicted as dead and less as alive)?
Hey group 7, today is the project deadline! Please provide your function that provides a PLOD for each character to group A and close the issue. Also, do not forget to write here a short summary of the development of your prediction model and the results.
Minor Update:
The age and the agegroup features are not independent. The links, scores, connections are redundant and dependent since they have too similar distribution visualization and meaning.
Look at the ranked attributes, the connections and the links have same value: 0.140813 3 house 0.057721 4 title 0.041583 10 connections 0.041583 9 links 0.020813 8 score 0.019474 2 culture 0.017936 1 dateOfBirth 0.013095 7 gender 0.006013 14 hasHouse 0.005081 18 multipleBooks 0.004775 11 hasHeir 0.003113 13 hasTitle 0.002646 15 hasSpouse 0.002198 17 isNoble 0.002079 12 hasHeirAlive 0.001826 6 age 0.000854 16 isSpouseAlive 0.000743 5 house_founded
After removing agegroup, links and score. Now we have 16 features for learning:
culture
house
title
house_founded
age
gender
links
hasHeir
hasHeirAlive
hasTitle
hasHouse
hasSpouse
isSpouseAlive
isNoble
multipleBooks
Result before removing these three features: === Confusion Matrix ===
a b <-- classified as
1315 158 | a = alive 327 142 | b = dead
Result after removing these three features: === Confusion Matrix ===
a b <-- classified as
1331 142 | a = alive 331 138 | b = dead
Summary of prediction model: Classifier: Naive Bayes Feature set: culture house title house_founded age gender links hasHeir hasHeirAlive hasTitle hasHouse hasSpouse isSpouseAlive isNoble multipleBooks Result: test5.zip
Hi @s-feng, I thought we agreed to use the model with the results
NB NEW - full set a b <-- classified as 1299 174 | a = alive 333 133 | b = dead
as the final model and you to provide the function delivering the PLODs to Group A by today (Friday) :)
Please do not forget to also briefly summarize the development of your method here.
@s-feng @nicoladesocio @dan736923 @konstantinos-angelo Hi there, did the delivery to Group A happen?
Summary of prediction model development:
Initially we had used SMO with polynomial-kernel as our machine learning method. Although the prediction accuracy of the preliminary result was 76%, the accuracy for the alive class was 95% while the accuracy for the dead class was 15%. the accuracy for the dead class which is important to us was far from enough.
By inspecting the dataset we have found that some features have very high missing value rates. For incomplete dataset, the bayesian method should have better result than SVM method. Then we tried the group of bayes classifier in Weka. We have found the Naive Bayes has the best acceptable result especially for the dead class.
Our dataset has 26 features. They are not equally important for classification. We selected features according to the ranking from Weka’s InfoGainAttributeEval attribute evaluator. Moreover, another criterion for feature selection is that Naive Bayes assumes all the features are independent. Ultimately we choose 16 features for our prediction model.
@s-feng please also provide the size of the data set, the features you used and the performance measures of your final model. thanks!
1939 characters
culture
house
title
house_founded
age
ageGroup
gender
score
links
connections
hasHeir
hasHeirAlive
hasTitle
hasHouse
hasSpouse
isSpouseAlive
isNoble
multipleBooks
status
=== Stratified cross-validation === === Summary ===
Correctly Classified Instances 1432 73.8525 % Incorrectly Classified Instances 507 26.1475 % Kappa statistic 0.1894 Mean absolute error 0.3181 Root mean squared error 0.4449 Relative absolute error 87.0854 % Root relative squared error 104.1132 % Total Number of Instances 1939
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.882 0.715 0.796 0.882 0.837 0.662 alive
0.285 0.118 0.433 0.285 0.344 0.662 dead
Weighted Avg. 0.739 0.571 0.709 0.739 0.718 0.662
=== Confusion Matrix ===
a b <-- classified as
1299 174 | a = alive 333 133 | b = dead
@s-feng the list of features contains 19 elements, while the description of your final model tells that this number is 16. Which of the two numbers is correct?
Also, please explain what the following features from your list mean:
Thank you!
A few more features are not clear to me. These are:
Also, please provide the list of features in a sorted order, from most to least contributing.
house_founded = legace feature, how old is the house a character belongs
ageGroup = ages seems very arbitrary in the way weka classified it as integer of continuous value. Also there where characters that had age 200000+. Age group classifies them into 3 groups 10(kids), 60(fighting age) and old people. So we tried to limit the ages into clear groups that are meaningfull and not as a constant attribute
score/links/connections = pagerank algo. Score is normalized PG score. Links is the amount of links that lead to a characters page. Connections is whether 2 characters has connections based on if there is a link connecting them. The difference between links and connections is that there were cases that characters had up to 5(? i think) same links to another character which we count and normalize in links, but in connections we count it as a single. E.g. John snow has 5 links to arya => links =5, conn=1
status the classificiation result alive/dead
@konstantinos-angelo I liked the answer, but honestly did not really get what the score is. Can you please explain very simply again? Thank you.
Sure but I cant explain much either :P
Ok so... Guy's ( or was it Dmitrii's) original PR script was providing 3 ranking attributes. Links, relevance and score. Our links and connections are derived from pr-links as explained. pr-score is the pagerank grade that is derived from the algo as a correlation between relevance and the amount of links.
So our score is the normalization of the score provided by PR.
This is a quick and dirty explanation as I am not on my PC right now. If you still think more explanation is needed I will write a detailed one later. My apologies for this.
@goldbergtatyana sorry I haven't reply you in time because I have had exams recently. the list of features with 19 elements is our final version. the list with 16 features is an assumption that the naive bayes is sensitive to redundant attributes. But the result with these 16 features have no definitive difference with the version with 19 elements.
Hi Tatyana,
We observed a better classification result using RBFNetwork classifier. The result is showing as below:
=== Stratified cross-validation === === Summary ===
Correctly Classified Instances 904 94.958 % Incorrectly Classified Instances 48 5.042 % Kappa statistic 0.8803 Mean absolute error 0.0868 Root mean squared error 0.2143 Relative absolute error 20.7623 % Root relative squared error 46.8889 % Total Number of Instances 952
=== Confusion Matrix ===
a b <-- classified as 641 28 | a = alive 20 263 | b = dead
Shall we change the learning algorithm to the RBFNetwork? It seems have a good enough accuracy and it's faster too.