Rostlab / JS16_ProjectB_Group7

Game of Thrones characters are always in danger of being eliminated. The challenge in this assignment is to see at what risk are the characters that are still alive of being eliminated. The goal of this project is to rank characters by their Percentage Likelihood of Death (PLOD). You will assign a PLOD using machine learning approaches.
GNU General Public License v3.0
1 stars 1 forks source link

Change SMO-poly to RBFNetwork? #20

Open s-feng opened 8 years ago

s-feng commented 8 years ago

Hi Tatyana,

We observed a better classification result using RBFNetwork classifier. The result is showing as below:

=== Stratified cross-validation === === Summary ===

Correctly Classified Instances 904 94.958 % Incorrectly Classified Instances 48 5.042 % Kappa statistic 0.8803 Mean absolute error 0.0868 Root mean squared error 0.2143 Relative absolute error 20.7623 % Root relative squared error 46.8889 % Total Number of Instances 952

=== Confusion Matrix ===

a b <-- classified as 641 28 | a = alive 20 263 | b = dead

Shall we change the learning algorithm to the RBFNetwork? It seems have a good enough accuracy and it's faster too.

s-feng commented 8 years ago

@goldbergtatyana

goldbergtatyana commented 8 years ago

28 alive characters who are predicted to be dead?? Who are they? :)))

The result looks very good, indeed. How many characters did you remove, what are the features you're using? How does your performance look if you compare yourself to random?

Did you also try other ML methods? The speed btw is very much unimportant, since we're running ML algorithm only once. Important are results!

Awesome would be if you would provide us with constant updates of your results here!

s-feng commented 8 years ago

Just tried Naive Bayes. Got a result like this:

Time taken to build model: 0.01 seconds

=== Stratified cross-validation === === Summary ===

Correctly Classified Instances 819 86.0294 % Incorrectly Classified Instances 133 13.9706 % Kappa statistic 0.7045 Mean absolute error 0.177 Root mean squared error 0.3044 Relative absolute error 42.335 % Root relative squared error 66.5947 % Total Number of Instances 952

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.804     0.007      0.996     0.804     0.89       0.993    alive
             0.993     0.196      0.682     0.993     0.809      0.993    dead

Weighted Avg. 0.86 0.063 0.903 0.86 0.866 0.993

=== Confusion Matrix ===

a b <-- classified as 538 131 | a = alive 2 281 | b = dead

@dan736923 @konstantinos-angelo @nicoladesocio

nicoladesocio commented 8 years ago

what filter you use? the one that is in the develop branch?

nicoladesocio commented 8 years ago

@s-feng btw if you try classificationViaRegression (in the meta folder) it has really impressive results. @goldbergtatyana probaily it overfits the data? === Stratified cross-validation === === Summary ===

Correctly Classified Instances 959 99.7919 % Incorrectly Classified Instances 2 0.2081 % Kappa statistic 0.9951 Mean absolute error 0.026 Root mean squared error 0.0643 Relative absolute error 6.1641 % Root relative squared error 14.001 % Total Number of Instances 961

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             1         0.007      0.997     1         0.999      0.995    alive
             0.993     0          1         0.993     0.997      0.995    dead

Weighted Avg. 0.998 0.005 0.998 0.998 0.998 0.995

=== Confusion Matrix ===

a b <-- classified as 671 0 | a = alive 2 288 | b = dead

s-feng commented 8 years ago
Name Missing
dateOfBirth 61%
dateOfDeath 72%
culture 64%
house 11%
title 43%
father 97%
mother 98%
heir 98%
house_founded 90%
num_houses_overlord 100%
age 63%
ageGroup 63%
gender 0%
score 0%
links 0%
connections 0%
hasHeir 0%
hasHeirAlive 1%
hasTitle 10%
hasHouse 0%
hasSpouse 0%
isSpouseAlive 6%
isNoble 0%
multipleBooks 0%
status 0%

Naivebayes: === Confusion Matrix ===

a    b   <-- classified as

1320 153 | a = alive 2 462 | b = dead

name: 'galazza galare' PLOD: 49.8% predicted as: dead name: 'tormund' PLOD: 49.3% predicted as: dead name: 'howland reed' PLOD: 38% predicted as: dead name: 'tyrion lannister' PLOD: 34.2% predicted as: dead name: 'ardrian celtigar' PLOD: 19.6% predicted as: dead name: 'olenna redwyne' PLOD: 18.9% predicted as: dead name: 'robert arryn' PLOD: 15.1% predicted as: dead name: 'jaime lannister' PLOD: 13.700000000000001% predicted as: dead

s-feng commented 8 years ago

The classification via regression has no alive predicted as dead. looks like over-estimated with too high order.

k-angelo commented 8 years ago

@s-feng What are the most important attributes for the character subset btw ?

nicoladesocio commented 8 years ago

@s-feng tried with more charachters and more validation, and still performs very good. there is also one alive predicted as dead. however it's true that the alive predicted as dead are less than the dead predicted as alive. this is done with 1761 charachters and 20 folds on cross validation === Summary ===

Correctly Classified Instances 1745 99.0914 % Incorrectly Classified Instances 16 0.9086 % Kappa statistic 0.976 K&B Relative Info Score 165601.9959 % K&B Information Score 1357.7942 bits 0.771 bits/instance Class complexity | order 0 1443.9304 bits 0.8199 bits/instance Class complexity | scheme 123.3226 bits 0.07 bits/instance Complexity improvement (Sf) 1320.6078 bits 0.7499 bits/instance Mean absolute error 0.0257 Root mean squared error 0.095 Relative absolute error 6.7578 % Root relative squared error 21.7782 % Total Number of Instances 1761

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.997     0.027      0.991     0.997     0.994      0.994    alive
             0.973     0.003      0.991     0.973     0.982      0.994    dead

Weighted Avg. 0.991 0.021 0.991 0.991 0.991 0.994

=== Confusion Matrix ===

a    b   <-- classified as

1307 4 | a = alive 12 438 | b = dead

PS: how do you compute the plod?

dan736923 commented 8 years ago

Here are the results for Naive Bayes. Data set consists of 954 characters generated with the latest version of createARFF.js. See zip file for details.

=== Summary === Correctly Classified Instances 789 82.7044 % Incorrectly Classified Instances 165 17.2956 % Kappa statistic 0.6398 Mean absolute error 0.207 Root mean squared error 0.3306 Relative absolute error 49.3779 % Root relative squared error 72.2279 % Total Number of Instances 954

=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.768 0.035 0.981 0.768 0.862 0.98 alive 0.965 0.232 0.64 0.965 0.769 0.98 dead Weighted Avg. 0.827 0.094 0.879 0.827 0.834 0.98

=== Confusion Matrix === a b <-- classified as 514 155 | a = alive 10 275 | b = dead

RANK NAME PLOD

  1. areo hotah 0.497
  2. lord commander hoare 0.497
  3. mord 0.496
  4. leo blackbar 0.496
  5. mariya darry 0.494
  6. marlon manderly 0.493
  7. brella 0.491
  8. ramsay snow 0.486
  9. antario jast 0.484
  10. aeron greyjoy 0.48 ...

Attributes contribution: I removed dateOfBirth, father, mother, heir, num_houses_overlord since they had a negative ranking. Then I reevaluated the attributes and got these results: dateOfDeath: 0.11797, house: 0.1002, gender: 0.05252, culture: 0.03868, age: 0.02446, ageGroup: 0.02282, hasTitle: 0.02002, hasHeir: 0.01981, hasHouse: 0.01855, isSpouseAlive: 0.01462, hasSpouse: 0.01289, title: 0.01258, score: 0.01242, links: 0.01229, connections: 0.01229, multipleBooks: 0.00901, hasHeirAlive: 0.00587, isNoble: 0.00503, house_founded: 0.00341

test4.zip

k-angelo commented 8 years ago

Isnt it strange that we have alive classified as dead but the plod is < 0.5 ? Shouldnt it be above that threshold ?

Also can you retrain without dateofdeath ?

dan736923 commented 8 years ago

=== Summary === Correctly Classified Instances 632 66.2474 % Incorrectly Classified Instances 322 33.7526 % Kappa statistic 0.1846 Mean absolute error 0.3561 Root mean squared error 0.4623 Relative absolute error 84.9621 % Root relative squared error 101.0062 % Total Number of Instances 954

=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.768 0.586 0.755 0.768 0.761 0.689 alive 0.414 0.232 0.432 0.414 0.423 0.689 dead Weighted Avg. 0.662 0.48 0.658 0.662 0.66 0.689

=== Confusion Matrix === a b <-- classified as 514 155 | a = alive 167 118 | b = dead

k-angelo commented 8 years ago

That certainly is interesting. I am not sure that we should use dateofdeath in our results since this seems like we overfit our data. There is almost a 1-1 connnection between dod and plod.

We can retry regression i guess and see again what happens.

Generally my feeling is that we have a good result when we classify as few dead people as alive as possible. So we confirm that dead are dead and then whatever discrepancy we have with alive classified as dead is what we are looking for

k-angelo commented 8 years ago

Correction : we should NOT use date of death for our calculation

s-feng commented 8 years ago

Evaluated attributes by using InfoGainAttributeEval, got this ranking:

Ranked attributes: 0.245386285658262144 4 house 0.102228084493921184 5 title 0.04709825838396464 1 dateOfBirth 0.027357430748710332 3 culture 0.020562054117495676 13 gender 0.019577176299923616 15 links 0.019577176299923616 16 connections 0.016114926878331604 14 score 0.014613599308347936 19 hasTitle 0.011421676273234316 17 hasHeir 0.010023205029333404 23 isNoble 0.007056916728209296 11 age 0.005446717931224088 18 hasHeirAlive 0.004316973941269065 22 isSpouseAlive 0.003776338882749642 12 ageGroup 0.001812264753613624 9 house_founded 0.000399081592559747 21 hasSpouse 0.000391338498300642 24 multipleBooks 0.000191120232670539 20 hasHouse 0.000000000000001665 7 mother 0.000000000000001665 8 heir 0 2 dateOfDeath 0 10 num_houses_overlord -0.000000000000000333 6 father

Start removing the attribute one after another from the bottom of the ranking. Skipped dateofDeath. The accuracy won't change after removed father, num_houses_overlord, heir, mother, has house. These five have about 99% missing rate.

gyachdav commented 8 years ago

Nice! I like the fact that house affiliation has such a strong impact on the prediction.

what's the feature named "score"? what does it stand for?

On Mar 21, 2016, at 11:30 AM, Shuo Feng notifications@github.com wrote:

Evaluated attributes by using InfoGainAttributeEval, got this ranking:

Ranked attributes: 0.245386285658262144 4 house 0.102228084493921184 5 title 0.04709825838396464 1 dateOfBirth 0.027357430748710332 3 culture 0.020562054117495676 13 gender 0.019577176299923616 15 links 0.019577176299923616 16 connections 0.016114926878331604 14 score 0.014613599308347936 19 hasTitle 0.011421676273234316 17 hasHeir 0.010023205029333404 23 isNoble 0.007056916728209296 11 age 0.005446717931224088 18 hasHeirAlive 0.004316973941269065 22 isSpouseAlive 0.003776338882749642 12 ageGroup 0.001812264753613624 9 house_founded 0.000399081592559747 21 hasSpouse 0.000391338498300642 24 multipleBooks 0.000191120232670539 20 hasHouse 0.000000000000001665 7 mother 0.000000000000001665 8 heir 0 2 dateOfDeath 0 10 num_houses_overlord -0.000000000000000333 6 father

Start removing the attribute one after another from the bottom of the ranking. Skipped dateofDeath. The accuracy won't change after removed father, num_houses_overlord, heir, mother, has house. These five have about 99% missing rate.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

k-angelo commented 8 years ago

Links, connections and score all come from the pagerank algorithm. Links are the normalized number of links leading to the page. Connections are similar to links but it doesnt take into account multiple links from other pages ( you either have a connection or you don't and the actual amount of links is irrelevant). Score is basically a normalization of the score attribute of the pagerank data.

goldbergtatyana commented 8 years ago

Very nice discussion! Comments so far:

  1. I would remove date of death (dod) from the feature set, as alive characters would not have a dod and so your ML approach learns nothing from using it
  2. you should not use a regression model. I'm wondering that regression works for you at all. The difference between regression and classification is that regression predicts a continues value (e.g. what the price of a car will be given 10 similar ones were sold already) while classification predicts a class (e.g. alive vs. dead)
  3. classification speed is unimportant!
  4. if the results on the full dataset (2000 characters - btw. it contains no 'house of ...' as a character, right?) are good, then please rather use full data set for this project
goldbergtatyana commented 8 years ago
  1. how come is the plod <50?
gyachdav commented 8 years ago

check this out for more feature that were tested:

http://www.techtimes.com/articles/16812/20140930/can-statistics-tell-us-who-dies-next-on-game-of-thrones.htm

k-angelo commented 8 years ago

The article is nice but as far as I understood the only feature tested was chapters per book and this is something not currently in our DB. Plus it deals only with major characters that have dedicated PoV chapters, while we deal with pretty much the whole GoT universe. I do not know how applicable that is in our case. We can research a bit more features that others tried to apply

gyachdav commented 8 years ago

True. Good observation. Thanks for pointing it out.

Sent from my iPhone

On Mar 21, 2016, at 4:49 PM, Kostas Angelo notifications@github.com wrote:

The article is nice but as far as I understood the only feature tested was chapters per book and this is something not currently in our DB. Plus it deals only with major characters that have dedicated PoV chapters, while we deal with pretty much the whole GoT universe. I do not know how applicable that is in our case. We can research a bit more features that others tried to apply

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

goldbergtatyana commented 8 years ago

Hi Group 7,

I overlooked the test4.zip file for Naive Bayes results from yesterday, sorry about that. Naive Bayes (the simplest ML algorithm ever :)) seems to do a really good job and should probably be the algorithm of your choice. To the PLODs: for each character, Weka provides a probability score for each of two classes. The probability from the second column is the PLOD score. For example, the PLOD score for 'raymun fossoway' from the file in the zip folder is 27% (please round % to integer values) and this is the value that needs to be forwarded to group A.

goldbergtatyana commented 8 years ago

Btw, with the feature set that you selected seems the predictions of dead characters seems to not change. For the data set of 2+K I can see that you predict 153 characters as dead, while on the smaller set this number is 131. Do these 131 form a subset of 153? If yes, then please go ahead and use the predictions on the full data set for forwarding them to A. Oh and how do the features of the few dead characters that are predicted to be alive look? Can it be that these are the characters who died but respawned again? If so, wow very cool!

dan736923 commented 8 years ago

So, I corrected a bug in parseWekaOutput.js. PLODs are now calculated correctly. :D I repeated the Weka runs without the minor attributes and dateOfDeath using a) the set of characters filtered by popularity (954), b) the unfiltered data set (1939). (No characters with 'house of ...', etc included) So far NaiveBayes seems to be the best option (see zip file). test5.zip

=== Confusion Matrix === (popularity filter) a b <-- classified as 514 155 | a = alive 167 118 | b = dead

Top PLODs (popularity filter)

  1. rhae targaryen 0.999
  2. aurane waters 0.99
  3. valaena velaryon 0.987
  4. illyrio mopatis 0.985
  5. rhaelle targaryen 0.977
  6. ardrian celtigar 0.95
  7. dunstan drumm 0.931
  8. tytos blackwood 0.93
  9. jonos bracken 0.92
  10. rolly duckfield 0.91

=== Confusion Matrix === (no filter) a b <-- classified as 1299 174 | a = alive 333 133 | b = dead

Top PLODs (no filter)

  1. cersei lannister 1
  2. roose bolton 1
  3. sansa stark 1
  4. tyrion lannister 1
  5. barristan selmy 1
  6. davos seaworth 1
  7. jaime lannister 1
  8. petyr baelish 1
  9. daenerys targaryen 1
  10. aurane waters 0.999
  11. theon greyjoy 0.999
  12. rhae targaryen 0.999
  13. jon snow 0.998
  14. walder frey 0.998
  15. tommen baratheon 0.997
  16. aegon targaryen son of baelon 0.995
  17. aegon i targaryen 0.995
  18. illyrio mopatis 0.994
  19. edmure tully 0.993
  20. varys 0.992
  21. jorah mormont 0.99
  22. mace tyrell 0.987
  23. joffrey baratheon 0.986
  24. stannis baratheon 0.985
  25. victarion greyjoy 0.98
  26. margaery tyrell 0.977
  27. bran stark 0.974

Finally some characters I know. :) Many of the characters in the top of the [no filter] list are predicted to be alive in the [popularity filter] run. As for the characters, who are actually alive and predicted to be dead: The ones in [popularity filter] are not a subset of the ones in [no filter]. (69 missing)

goldbergtatyana commented 8 years ago

@dan736923 thanks for correcting the bug, though it seems it was not the last one... ;-)

Comparing old results with the new ones, you can see that the results for alive characters did not change through the transition from old to new, however the results for dead characters did -> there are so many more dead misclassified as alive. Why did it happen?

NB NEW - full set a b <-- classified as 1299 174 | a = alive 333 133 | b = dead

NB OLD - full set a b <-- classified as 1320 153 | a = alive 2 462 | b = dead

NB NEW - filtered set a b <-- classified as 514 155 | a = alive 167 118 | b = dead

NB OLD - filtered set a b <-- classified as 514 155 | a = alive 10 275 | b = dead

k-angelo commented 8 years ago

@goldbergtatyana I think a lot of this has to do that old datasets where incorrectly trained with the dateofdeath attribute which is an almost 1 to 1 death indicator (overfit ?).

These last predictions are more or less our true predictions since we started training without dod. I am mostly interested in what dead characters predicted as alive share in common.

goldbergtatyana commented 8 years ago

@konstantinos-angelo Agree, there was a one to one relationship between dod and the prediction. So, the last model you provided and that was developed on the full dataset is the one to use!

Since we have one more day until the delivery deadline, I would love having you try one more thing:

Each character in your prediction set has a PLOD assigned. By default, the PLOD of 50 discriminates between dead and alive characters. Would the performance of your model change, if you lower the PLOD (i.e. have more characters predicted as dead and less as alive)?

goldbergtatyana commented 8 years ago

Hey group 7, today is the project deadline! Please provide your function that provides a PLOD for each character to group A and close the issue. Also, do not forget to write here a short summary of the development of your prediction model and the results.

s-feng commented 8 years ago

Minor Update:

The age and the agegroup features are not independent. The links, scores, connections are redundant and dependent since they have too similar distribution visualization and meaning.

Look at the ranked attributes, the connections and the links have same value: 0.140813 3 house 0.057721 4 title 0.041583 10 connections 0.041583 9 links 0.020813 8 score 0.019474 2 culture 0.017936 1 dateOfBirth 0.013095 7 gender 0.006013 14 hasHouse 0.005081 18 multipleBooks 0.004775 11 hasHeir 0.003113 13 hasTitle 0.002646 15 hasSpouse 0.002198 17 isNoble 0.002079 12 hasHeirAlive 0.001826 6 age 0.000854 16 isSpouseAlive 0.000743 5 house_founded

After removing agegroup, links and score. Now we have 16 features for learning:

          culture
          house
          title
          house_founded
          age
          gender
          links
          hasHeir
          hasHeirAlive
          hasTitle
          hasHouse
          hasSpouse
          isSpouseAlive
          isNoble
          multipleBooks

Result before removing these three features: === Confusion Matrix ===

a    b   <-- classified as

1315 158 | a = alive 327 142 | b = dead

Result after removing these three features: === Confusion Matrix ===

a    b   <-- classified as

1331 142 | a = alive 331 138 | b = dead

s-feng commented 8 years ago

Summary of prediction model: Classifier: Naive Bayes Feature set: culture house title house_founded age gender links hasHeir hasHeirAlive hasTitle hasHouse hasSpouse isSpouseAlive isNoble multipleBooks Result: test5.zip

goldbergtatyana commented 8 years ago

Hi @s-feng, I thought we agreed to use the model with the results

NB NEW - full set a b <-- classified as 1299 174 | a = alive 333 133 | b = dead

as the final model and you to provide the function delivering the PLODs to Group A by today (Friday) :)

Please do not forget to also briefly summarize the development of your method here.

goldbergtatyana commented 8 years ago

@s-feng @nicoladesocio @dan736923 @konstantinos-angelo Hi there, did the delivery to Group A happen?

s-feng commented 8 years ago

Summary of prediction model development:

Initially we had used SMO with polynomial-kernel as our machine learning method. Although the prediction accuracy of the preliminary result was 76%, the accuracy for the alive class was 95% while the accuracy for the dead class was 15%. the accuracy for the dead class which is important to us was far from enough.

By inspecting the dataset we have found that some features have very high missing value rates. For incomplete dataset, the bayesian method should have better result than SVM method. Then we tried the group of bayes classifier in Weka. We have found the Naive Bayes has the best acceptable result especially for the dead class.

Our dataset has 26 features. They are not equally important for classification. We selected features according to the ranking from Weka’s InfoGainAttributeEval attribute evaluator. Moreover, another criterion for feature selection is that Naive Bayes assumes all the features are independent. Ultimately we choose 16 features for our prediction model.

goldbergtatyana commented 8 years ago

@s-feng please also provide the size of the data set, the features you used and the performance measures of your final model. thanks!

s-feng commented 8 years ago

Data set size:

1939 characters

Features used in test5:

          culture
          house
          title
          house_founded
          age
          ageGroup
          gender
          score
          links
          connections
          hasHeir
          hasHeirAlive
          hasTitle
          hasHouse
          hasSpouse
          isSpouseAlive
          isNoble
          multipleBooks
          status

Performance of test5:

=== Stratified cross-validation === === Summary ===

Correctly Classified Instances 1432 73.8525 % Incorrectly Classified Instances 507 26.1475 % Kappa statistic 0.1894 Mean absolute error 0.3181 Root mean squared error 0.4449 Relative absolute error 87.0854 % Root relative squared error 104.1132 % Total Number of Instances 1939

=== Detailed Accuracy By Class ===

           TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
             0.882     0.715      0.796     0.882     0.837      0.662    alive
             0.285     0.118      0.433     0.285     0.344      0.662    dead

Weighted Avg. 0.739 0.571 0.709 0.739 0.718 0.662

=== Confusion Matrix ===

a    b   <-- classified as

1299 174 | a = alive 333 133 | b = dead

goldbergtatyana commented 8 years ago

@s-feng the list of features contains 19 elements, while the description of your final model tells that this number is 16. Which of the two numbers is correct?

Also, please explain what the following features from your list mean:

Thank you!

goldbergtatyana commented 8 years ago

A few more features are not clear to me. These are:

Also, please provide the list of features in a sorted order, from most to least contributing.

k-angelo commented 8 years ago
k-angelo commented 8 years ago

house_founded = legace feature, how old is the house a character belongs

ageGroup = ages seems very arbitrary in the way weka classified it as integer of continuous value. Also there where characters that had age 200000+. Age group classifies them into 3 groups 10(kids), 60(fighting age) and old people. So we tried to limit the ages into clear groups that are meaningfull and not as a constant attribute

score/links/connections = pagerank algo. Score is normalized PG score. Links is the amount of links that lead to a characters page. Connections is whether 2 characters has connections based on if there is a link connecting them. The difference between links and connections is that there were cases that characters had up to 5(? i think) same links to another character which we count and normalize in links, but in connections we count it as a single. E.g. John snow has 5 links to arya => links =5, conn=1

status the classificiation result alive/dead

goldbergtatyana commented 8 years ago

@konstantinos-angelo I liked the answer, but honestly did not really get what the score is. Can you please explain very simply again? Thank you.

k-angelo commented 8 years ago

Sure but I cant explain much either :P

Ok so... Guy's ( or was it Dmitrii's) original PR script was providing 3 ranking attributes. Links, relevance and score. Our links and connections are derived from pr-links as explained. pr-score is the pagerank grade that is derived from the algo as a correlation between relevance and the amount of links.

So our score is the normalization of the score provided by PR.

This is a quick and dirty explanation as I am not on my PC right now. If you still think more explanation is needed I will write a detailed one later. My apologies for this.

s-feng commented 8 years ago

@goldbergtatyana sorry I haven't reply you in time because I have had exams recently. the list of features with 19 elements is our final version. the list with 16 features is an assumption that the naive bayes is sensitive to redundant attributes. But the result with these 16 features have no definitive difference with the version with 19 elements.