Closed sacdallago closed 8 years ago
@AlexMoroz try to see if the package really delivers predictions for all characters.. testing!
Working on testing
gotarffplod tests
predictions testing
Total number of characters: 1939
Plod exists for : 174 characters
1) should return predictions for all characters
0 passing (26ms)
1 failing
1) gotarffplod tests predictions testing should return predictions for all characters:
Error: expected 174 to equal 1939
Done.
I thought that predictions.json should be generated every time after init, but now you are only reading from it.
No. The JSON files are static. You call init() to read these files and fill the variables needed for the other functions. This is not new, we always had this approach. Also: The plod is only generated for characters that are alive and predicted to be dead.
var p = require("npm.js"); p.init(); p.getPlod("Samwell Tarly"); // 0.719 p.getPlod("Eddard Stark"); // undefined, because he's already dead
Ok but of the ~2k only 174 are alive? I don't get it.. What happens to them?
I would put PLOD => 100 if they are dead. And I'm still wondering what semantic tells you which of the characters whitout PLOD is dead and which you decided not to calculate the PLOD for :D
174 are the alive characters predicted as dead. the other ones are the alive characters predicted to be alive (so low plod).
So for the rest we say that they will remain alive, but we have the plod computed. We simply do not return it, but we return undefined, even if the charachters are alive and could die. Simply our model classifies the other charachters (1939 - 174) as alive that will not die.
We discussed this privately, and we came out with this solution.
So for the rest we say that they will remain alive, but we have the plod computed.
The plod exists only for 174 characters in prediction.json.
All, we need the plod score for everyone. Sorry for not making it clear enough I my prev explanations. To get the plod for characters who are predicted Alive is very simple: their plod = 100 - prob of being alive. :)
Ok. I had this in an earlier version of my weka parser, but was told to remove it. :D I'll change this in moment.
To summarize: It is ok to give plod=100 to someone who is already dead (eg Eddard Stark). Those who are predicted as dead (174 characters) get the plod from your prediction model. All others get a plod = 100 - prob of being alive (as a test: the value of this plod score should be <0.5).
That's a problem. If we assign 100% to dead characters, then the top-list will be full of random dead characters. Also: we have alive characters with a 100% PLOD (E.g. 'cersei lannister': 1)
Ok, then the plod score can be either 'not applicable' or negative for already dead characters.
-1 is better then 'not applicable', if other scores are numbers
To me, totally ignorant, there should just be a classification for: dead, plodded, not plodded because predicted to stay alive forever. If you have two classifications that end up in one value, like right now "undefined" I cannot tell the difference.
I hope this is clear :) I don't mean to be grumpy, I'm just plain stupid and need the "PLOD explained for dummies" formatted data, that's all it is :D
@goldbergtatyana I made a strange observation. Probabilities for actually alive characters predicted to be alive: name | probability (sorted) [ 'pyat pree', 0.498 ], [ 'tybolt hetherspoon', 0.497 ], [ 'timett', 0.494 ], ... [ 'matrice', 0.002 ], [ 'dalla dragonstone ', 0.002 ], [ 'lamprey', 0.001 ] ]
Probabilities for actually alive characters predicted to be dead: name | probability (sorted) [ 'tyrion lannister', 1 ], [ 'daenerys targaryen', 1 ], [ 'roose bolton', 1 ], [ 'barristan selmy', 1 ], [ 'jaime lannister', 1 ], [ 'sansa stark', 1 ], [ 'cersei lannister', 1 ], [ 'petyr baelish', 1 ], [ 'davos seaworth', 1 ], [ 'aurane waters', 0.999 ] ... [ 'wylis manderly', 0.501 ], [ 'jon wylde', 0.501 ], [ 'terrence kenning', 0.5 ]
Taken from test5.zip - no-filter - Naive Bayes. So with plod = 1.0 - probability (for alive characters predicted to be alive), the plods will be greater than 50%. Does the NaiveBayes-classifier handle probabilities differently? This looks like plod = probability no matter what predicted status.
Hah, now I'm glad I put the warning of <0.5, because we were so close (so so really close) making another mistake.
Okay, so, the probabilities you are reporting in the first part of your note are probabilities for characters to be dead! For these probs you don't need to do any subtraction anymore. Looking at the prediction file (NaiveBayes.txt), I see that all you need to do is to get the PLOD scores is to fetch the results from the second probabilities column, that's all! :)
satin will get a plod of 0.249 jason mallister of 0.428 wayn (guard) 0.12 ... edmure tully 0.993
I would provide PLODs for all actually dead and alive characters - it could be these values will be useful in the future??
@nicoladesocio, @dan736923 what's the status? did you fix it?
Fixed. PLODs are calculated for all characters, dead or alive.
@dan736923 you maybe need to update prediction.json
ah, ok, haven't seen pull request
What about these comments https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205231382 and https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205282000
The number of characters predicted by Group 6 (gotplod) is 1946 - which is correct, while the number of characters predicted by Group 7 (gotarffplod) is 1939 - this number is wrong.
@dan736923 but you used all of these features for the prediction of plods for 1939 characters. so, yeah, we do need plod predictions for all of them!
You still have 1939 characters, is this ok?
@dan736923 @goldbergtatyana @sacdallago
@AlexMoroz yeah its ok that the numbers dont match 100%. To report the PLOD scores, Group F will have to check:
So, you see PLOD results of both groups are not dependent on each other in any way.
So our original ARFF input file contained 1945 characters. I removed a few characters with date of birth/death over 200000. This is the data set the JSON files in the repository are currently based on. If I repeat the ML run with all 1945 characters the prediction model gets worse. If I refetch the characters from the database of Project A now, I get a set of 2028 characters.
Ok, we can close this)
@dan736923 we have a set of predictions for 1946 characters from the final model that I reviewed and okayed. Lets stick with this model and its results. If we have no results for some (most prob minor not so important) characters, then this is how it is.
@AlexMoroz yep, we can close this.
https://github.com/Rostlab/JS16_ProjectA/issues/129#issuecomment-205205878