Rostlab / JS16_ProjectB_Group7

Game of Thrones characters are always in danger of being eliminated. The challenge in this assignment is to see at what risk are the characters that are still alive of being eliminated. The goal of this project is to rank characters by their Percentage Likelihood of Death (PLOD). You will assign a PLOD using machine learning approaches.
GNU General Public License v3.0
1 stars 1 forks source link

Predictions not returned for all characters #31

Closed sacdallago closed 8 years ago

sacdallago commented 8 years ago

https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205205878

sacdallago commented 8 years ago

@AlexMoroz try to see if the package really delivers predictions for all characters.. testing!

sacdallago commented 8 years ago

https://github.com/Rostlab/JS16_ProjectA/blob/master/app/controllers/filler/characters.js#L311

AlexMoroz commented 8 years ago

Working on testing

AlexMoroz commented 8 years ago
 gotarffplod tests
    predictions testing
Total number of characters:  1939
Plod exists for : 174 characters
  1) should return predictions for all characters
  0 passing (26ms)
  1 failing
  1) gotarffplod tests predictions testing should return predictions for all characters:
     Error: expected 174 to equal 1939

Done.

AlexMoroz commented 8 years ago

I thought that predictions.json should be generated every time after init, but now you are only reading from it.

dan736923 commented 8 years ago

No. The JSON files are static. You call init() to read these files and fill the variables needed for the other functions. This is not new, we always had this approach. Also: The plod is only generated for characters that are alive and predicted to be dead.

var p = require("npm.js"); p.init(); p.getPlod("Samwell Tarly"); // 0.719 p.getPlod("Eddard Stark"); // undefined, because he's already dead

sacdallago commented 8 years ago

Ok but of the ~2k only 174 are alive? I don't get it.. What happens to them?

I would put PLOD => 100 if they are dead. And I'm still wondering what semantic tells you which of the characters whitout PLOD is dead and which you decided not to calculate the PLOD for :D

nicoladesocio commented 8 years ago

174 are the alive characters predicted as dead. the other ones are the alive characters predicted to be alive (so low plod).

So for the rest we say that they will remain alive, but we have the plod computed. We simply do not return it, but we return undefined, even if the charachters are alive and could die. Simply our model classifies the other charachters (1939 - 174) as alive that will not die.

We discussed this privately, and we came out with this solution.

AlexMoroz commented 8 years ago

So for the rest we say that they will remain alive, but we have the plod computed.

The plod exists only for 174 characters in prediction.json.

goldbergtatyana commented 8 years ago

All, we need the plod score for everyone. Sorry for not making it clear enough I my prev explanations. To get the plod for characters who are predicted Alive is very simple: their plod = 100 - prob of being alive. :)

dan736923 commented 8 years ago

Ok. I had this in an earlier version of my weka parser, but was told to remove it. :D I'll change this in moment.

goldbergtatyana commented 8 years ago

To summarize: It is ok to give plod=100 to someone who is already dead (eg Eddard Stark). Those who are predicted as dead (174 characters) get the plod from your prediction model. All others get a plod = 100 - prob of being alive (as a test: the value of this plod score should be <0.5).

dan736923 commented 8 years ago

That's a problem. If we assign 100% to dead characters, then the top-list will be full of random dead characters. Also: we have alive characters with a 100% PLOD (E.g. 'cersei lannister': 1)

goldbergtatyana commented 8 years ago

Ok, then the plod score can be either 'not applicable' or negative for already dead characters.

AlexMoroz commented 8 years ago

-1 is better then 'not applicable', if other scores are numbers

sacdallago commented 8 years ago

To me, totally ignorant, there should just be a classification for: dead, plodded, not plodded because predicted to stay alive forever. If you have two classifications that end up in one value, like right now "undefined" I cannot tell the difference.

I hope this is clear :) I don't mean to be grumpy, I'm just plain stupid and need the "PLOD explained for dummies" formatted data, that's all it is :D

dan736923 commented 8 years ago

@goldbergtatyana I made a strange observation. Probabilities for actually alive characters predicted to be alive: name | probability (sorted) [ 'pyat pree', 0.498 ], [ 'tybolt hetherspoon', 0.497 ], [ 'timett', 0.494 ], ... [ 'matrice', 0.002 ], [ 'dalla dragonstone ', 0.002 ], [ 'lamprey', 0.001 ] ]

Probabilities for actually alive characters predicted to be dead: name | probability (sorted) [ 'tyrion lannister', 1 ], [ 'daenerys targaryen', 1 ], [ 'roose bolton', 1 ], [ 'barristan selmy', 1 ], [ 'jaime lannister', 1 ], [ 'sansa stark', 1 ], [ 'cersei lannister', 1 ], [ 'petyr baelish', 1 ], [ 'davos seaworth', 1 ], [ 'aurane waters', 0.999 ] ... [ 'wylis manderly', 0.501 ], [ 'jon wylde', 0.501 ], [ 'terrence kenning', 0.5 ]

Taken from test5.zip - no-filter - Naive Bayes. So with plod = 1.0 - probability (for alive characters predicted to be alive), the plods will be greater than 50%. Does the NaiveBayes-classifier handle probabilities differently? This looks like plod = probability no matter what predicted status.

goldbergtatyana commented 8 years ago

Hah, now I'm glad I put the warning of <0.5, because we were so close (so so really close) making another mistake.

Okay, so, the probabilities you are reporting in the first part of your note are probabilities for characters to be dead! For these probs you don't need to do any subtraction anymore. Looking at the prediction file (NaiveBayes.txt), I see that all you need to do is to get the PLOD scores is to fetch the results from the second probabilities column, that's all! :)

satin will get a plod of 0.249 jason mallister of 0.428 wayn (guard) 0.12 ... edmure tully 0.993

I would provide PLODs for all actually dead and alive characters - it could be these values will be useful in the future??

AlexMoroz commented 8 years ago

@nicoladesocio, @dan736923 what's the status? did you fix it?

dan736923 commented 8 years ago

Fixed. PLODs are calculated for all characters, dead or alive.

AlexMoroz commented 8 years ago

@dan736923 you maybe need to update prediction.json

AlexMoroz commented 8 years ago

ah, ok, haven't seen pull request

AlexMoroz commented 8 years ago

What about these comments https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205231382 and https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205282000

The number of characters predicted by Group 6 (gotplod) is 1946 - which is correct, while the number of characters predicted by Group 7 (gotarffplod) is 1939 - this number is wrong.

@dan736923 but you used all of these features for the prediction of plods for 1939 characters. so, yeah, we do need plod predictions for all of them!

You still have 1939 characters, is this ok?

@dan736923 @goldbergtatyana @sacdallago

goldbergtatyana commented 8 years ago

@AlexMoroz yeah its ok that the numbers dont match 100%. To report the PLOD scores, Group F will have to check:

So, you see PLOD results of both groups are not dependent on each other in any way.

dan736923 commented 8 years ago

So our original ARFF input file contained 1945 characters. I removed a few characters with date of birth/death over 200000. This is the data set the JSON files in the repository are currently based on. If I repeat the ML run with all 1945 characters the prediction model gets worse. If I refetch the characters from the database of Project A now, I get a set of 2028 characters.

AlexMoroz commented 8 years ago

Ok, we can close this)

goldbergtatyana commented 8 years ago

@dan736923 we have a set of predictions for 1946 characters from the final model that I reviewed and okayed. Lets stick with this model and its results. If we have no results for some (most prob minor not so important) characters, then this is how it is.

@AlexMoroz yep, we can close this.