Rostlab / JS16_ProjectB_Group6

Game of Thrones characters are always in danger of being eliminated. The challenge in this assignment is to see at what risk are the characters that are still alive of being eliminated. The goal of this project is to rank characters by their Percentage Likelihood of Death (PLOD). You will assign a PLOD using machine learning approaches.
GNU General Public License v3.0
3 stars 4 forks source link

Feature selection #16

Closed Hack3l closed 8 years ago

Hack3l commented 8 years ago
asesselmann commented 8 years ago

Please check out the new branch arff_conversion. Pulling data works, also the created arff file looks fine.

For now there are a lot of '?' in the file, because the database is not populated yet. We also can add more features, I only did the easy ones :)

(It works, but I actually never worked with callbacks before, hope you can understand what I did :) )

subburamr commented 8 years ago

Possible list of additional features for evaluation:

goldbergtatyana commented 8 years ago

@subburamr very good suggestions!

another feature to use could be: relationship to a dead character

sacdallago commented 8 years ago

Remember the feature freeze! Try to be proactive about the data / don't just ask A if they con provide you with it, try to see if you can implement something to get that data, get it from other sources/databases/wikis/imdb whatever

asesselmann commented 8 years ago

List of features we already have: name title gender culture dateOfBirth dateOfDeath mother father heir house spouse isAlive

features we have implemented but have not been written to the database yet: placeOfBirth placeOfDeath allegiance characterPopularity parents books placeOfLastVisit

We should think of: [] make spouse a boolean [] consider characters older than 100 (?) years as dead

goldbergtatyana commented 8 years ago

Very good suggestions!

Apropos, "mother", "father", "spouse" - makes sense asking if they are dead already (and having the answer as a boolean) and "PlaceOfDeath" does probably not apply to alive characters for whom we want to make a prediction of death :) Also, note, some of the features you can compile yourself without waiting for the database to be populated with them.

asesselmann commented 8 years ago

It is right, that placeOfDeath is not useful for living characters. But we consider all characters as dead, that either have a specified date or place of death. So this will help us with the labeling of characters as dead or alive. So I think it would be quite nice to get this feature :)

Hack3l commented 8 years ago

And we wanted to use the placeOfDeath together with the last visited place for living characters as a feature (so places where a lot of people died are more likely for other people to die)

asesselmann commented 8 years ago

We also should think about kicking characters which have far to few features out of the training set? Would that influence our prediction?

goldbergtatyana commented 8 years ago

@asesselmann I agree that features PlaceOfDeath and DateOfDeath are really good features on their own. However, for the prediction of PLODs they will be useless as they will never have an entry for characters who are still alive.

goldbergtatyana commented 8 years ago

@Hack3l "the last visited place" makes a lot of sense, though "placeOfDeath" will always be blank for those who are alive :)

Hack3l commented 8 years ago

@goldbergtatyana Yes but we can use placeOfDeat as the last visited place for dead characters

goldbergtatyana commented 8 years ago

@asesselmann

We also should think about kicking characters which have far to few features out of the training set? please have a look at the dead characters who are always misclassified by your predictor. Is there a pattern that can be recognized by an eye? Can it be that if a character has no or really only very few (and rather unimportant features) than he is always predicted alive or dead? If this is the case, kick them out and say no prediction possible.

goldbergtatyana commented 8 years ago

@Hack3l i dont quite understand. if you would have a feature called "last_visited_place" and the person is dead then ML will already understand that there is a connection between place and death.

Btw, can you compile information where most characters die? If there is a pattern, then lets have features "visited_place_1", "visited_place_2" and so on and have yes/no or 0/1 if a character has been to this places. Is this idea clear?

Hack3l commented 8 years ago

@goldbergtatyana i think we meant different things i was talking about takeing the value of placeOfDeath for the feature last_visited_place not the feature placeOfDeath ^^ Yes we can try that

asesselmann commented 8 years ago

I'm currently working on booleans isAlive for mother, father,... @Hack3l What should we use as a default value? If a character has no defined mother, shall we consider the mother as dead or alive? If we use NUMERIC values, can we still have question marks for unknown? (What I mean: does weka handle question marks as undefined, and are 0 and 1 then still considered booleans?)

Hack3l commented 8 years ago

@asesselmann maybe a third value for undefined?

asesselmann commented 8 years ago

like "?" ?

asesselmann commented 8 years ago

cite: "If we use NUMERIC values, can we still have question marks for unknown? (What I mean: does weka handle question marks as undefined, and are 0 and 1 then still considered booleans?"

goldbergtatyana commented 8 years ago

@asesselmann @Hack3l If a value is unknown, you must explicitly represent it with a question mark (?).

asesselmann commented 8 years ago

@goldbergtatyana Question 1: So if the datatype is boolean, we have three different values: ?, 0, 1 ? Question 2: If the datatype is String, do we need to use ?, or is "?" also ok? Question 3: We do not know if any Character really is alive, we just consider him as alive, if he has not died yet. For this value it does not seem to make sense for me to use ?, but just to set isAlive = true.

Hack3l commented 8 years ago

@asesselmann yes it should work with 0,1 and ? no "?" is not ok it didnt work for me

goldbergtatyana commented 8 years ago

@asesselmann Q1: yes; Q2: I thought it is also a yes, but @Hack3l says it's a no, please check the weka's tutorial (I'm on the phone right now, so would be way slower if I do it); Q3: yep, the class should only be dead or alive. Btw, characters who are older than 100 should be labeled deaaaad. :)

Hack3l commented 8 years ago

for the first weka tests i needed to remove the " around "?" because weka thought it was a string

asesselmann commented 8 years ago

Ok, but is it ok the way it is at the moment?

Hack3l commented 8 years ago

@asesselmann Yes thats exactly how it should be

asesselmann commented 8 years ago

Great :)

Project A mentioned, that we are season six is playing in 305 AC. Should I use that date to calculate the ages of characters?

https://github.com/Rostlab/JS16_ProjectA/issues/77