PokemonGoers / PredictPokemon-2

In this project we will apply machine learning to establish the TLN (Time, Location and Name - that is where pokemons will appear, at what date and time, and which Pokemon will it be) prediction in Pokemon Go.
Apache License 2.0
9 stars 3 forks source link

Possible features #2

Closed sacdallago closed 8 years ago

sacdallago commented 8 years ago
goldbergtatyana commented 8 years ago

Hi @bensLine @semioniy @Aurel-Roci ! How are you getting along with the machine learning tutorial? Is all clear and you have a good feeling on how to operate weka?

semioniy commented 8 years ago

Hi, @goldbergtatyana. Not really good. It's about clear what you said in tutorial, but it's hard to go beyond that toward real understanding what it does and how it does it. Still trying to figure out though.

semioniy commented 8 years ago

@bensLine, @Aurel-Roci, how do you think, maybe we should consider the level of the trainer as well? This data won't probably be really hard to gather, but maybe it influences, how common/ rare are the pokemon the one finds?

Aurel-Roci commented 8 years ago

@semioniy from what I have found from friends that play the game, at least for them it depends on the area, not on the level they have. The only thing that differs depending on the level is the CP the pokemon have.

Aurel-Roci commented 8 years ago

Hi, @goldbergtatyana the tutorial was good, and since I was there I got a better understanding of it. But what I am having problems with now, is using the dummy data provided here https://github.com/gyachdav/pokemongo; how to write them in an .arff file to test them on weka. Anyway I can get some help with that?

bensLine commented 8 years ago

Hi @goldbergtatyana, sorry for the delay I was on holidays. Thanks for the tutorial it was great for refreshing, I used Weka already.

However, the way I understood the project is that the TLN predictions will be query based and there are two use cases:

@Aurel-Roci I committed a script to parse the data into an .arff file (https://github.com/PokemonGoers/PredictPokemon-2/tree/feature/test_data). It's very basic and might not be good JS practice - haven't used JS before :p I did not extract all data from the JSON file, e.g. id of the entry, as it is unique, or user id, as I don't think it influences the appearance of Pokemon. However, feel free to adapt the script.

I also added one more feature: the distance to a reference point for testing purposes. I wanted to extract time features too, but most of the timestamps in the dummy data are all within a 5min window, so we cannot really use that data to find appropriate time intervals. However, we could slice the day into intervals and then use them as features too, e.g. 3h slices. Maybe some Pokemon appear more likely in the evening.

@semioniy concerning the trainer level, I would go with Aurel's answer

@sacdallago the hypothesis sounds good, we have to research if we can find a data source to extract features for that. I did not yet look into it.

sacdallago commented 8 years ago

@bensLine you commited directly on develop. Not good. Please use GitFlow/Feature model!!! Read some of the first announcement emails!.. From now on, we deduct points to who commits without opening PRs

goldbergtatyana commented 8 years ago

Hi Machine Learners! Sorry, I was absent yesterday, but now I am back :wave:

@bensLine it is great how you summarized the ML problem. Really, good job!

At the moment I can also think of only these two queries. However, for the second one we can also set a time frame of say 5 minutes or half an hour. Then only the location will be unknown.

@Aurel-Roci et al.: did you get the arff file from the dummy data? did you already try apply any of the ML algorithms on it?

bensLine commented 8 years ago

Thanks @goldbergtatyana, it is nice to have the time frame for the second query.

Here are some ideas about features:

  1. network coverage
  2. Population
  3. Location Popularity
  4. Weather (perception, temp, sun hours, ..)
  5. Sunrise sunset (time before/after)
  6. terrain (water, mountains, ...)

However, those features usually rely on an API to get some data. But most of the APIs allow only a limited amount of requests per day. I assume that those limits could be reached and our IP and/or API key get blocked. Should we ignore this or are there other ideas(several ips/keys, ...)?

What do you think about the features, are they good, should we integrate them?

  1. we did not yet find a good solution. http://www.sensorly.com/ and http://developer.opensignal.com/networkrank/ provide information but best would be to create an own (offline) map out of their data, which is probably a lot of work.
  2. Should be easier to obtain e.g. from https://query.wikidata.org/. However, we would need several requests to resolve lat/lon to a location and then to the population of the referred city if there is one..
  3. With the consideration of the previous text, it might be better to have a popularity measure for locations. However, how to calculate the popularity from lat/lon needs to be further investigated. Do you think it would make sense?
  4. For weather data, we could use http://openweathermap.org but the API only provides current weather information for free. So we cannot enrich historical data (e.g. data set from team A or queries for past dates). If the data from A provides the features already it should be fine (queries with past dates would be neglected). However, here we'll have the problem with the API request limit.
  5. Might be already available through weather data. Otherwise, another source needs to be found.
  6. The terrain for lat/lon can be used to specify if water is close or not and other types. To obtain the data the overpass API for open street maps can be used. http://overpass-turbo.eu/

So, what do you think about the different features/data sources? Which one should we implement and how should we deal with the query limits?

Furthermore, should we actually use lon and lat directly as features? I read that people rasterize the earth surface and use cell ids instead. E.g. by using Google's S2 library cells with up 1 cm² can be created. However, this would end up in a huge value range even with large cells. Anyway, we would not have the continues lat/lon values anymore and could express the location as a single feature. S2 also preserves spatial locality. But again, not sure if it's worth to implement that. is it a good idea? There are JS ports of the library. One is actually supposed to be used by pokemon go itself. https://github.com/Daplie/s2-geometry.js

And related to that: we consider the whole globe, right? Or can we focus on a smaller area, e.g. only Europe?

goldbergtatyana commented 8 years ago

Great points @bensLine!!!

Let us get back to you shortly about the features you suggested.

As to Google's S2 library: it looks like something what we exactly need! Using a grid of 1cm2 is prob too much for us. What we rather have in mind is the following:

Query 1: a user opens an app and we predict a Pokemon at every 200m (or 500m or 1km - we'll need to see which one performs best with out ml tool) within a square of 10 x 10 km. So, a user can zoom out and move the map within a square of 10 x 10 km without a need for our ml to redo a calculation. If a user moves out from this square, then we'll need to run our ml method again for a new 10 x 10 km square.

Query 2: we rasterize the earth surface at a very broad grained level (eg in squares of 100km in length, maybe even more) and then we predict pokemons globe-wide.

For both queries we run predictions for the next 30mins or hour (again, dependent on where ml tool performs best) and the pipeline should be the same.

goldbergtatyana commented 8 years ago

@gyachdav

gyachdav commented 8 years ago

@bensLine the features you're suggesting sound promising. However at this point I would recommend trying to simplify then over stretching yourself. Check out this simple tutorial first - http://tutorials.pluralsight.com/big-data/use-a-data-analytics-tool-to-predict-where-the-pokemon-are-going-to-appear

The only additional features beyond TLNs are weather related features and proximity to grass, water, buildings.

I would suggest you:

  1. generate an arff file based on these features.
  2. use the S2 library to get a cell ID from the lon/lat pairs. you probably want to focus on a set of defined cells as a start.
  3. predict and benchmark your learner on this input set
  4. report your outcome to us.

Once we mastered working with spatial data and predictions we can move on to test other features. But let's get the basics first!

bensLine commented 8 years ago

@gyachdav @goldbergtatyana thanks for the advice! We'll look into that.

goldbergtatyana commented 8 years ago

As a proof of principle we can train a prediction method already now based on dummy data. Task #1: upload here a working arff file

goldbergtatyana commented 8 years ago

Hey all,

Let's have the features summarized here again:

  1. Location
  2. Weather (e.g. temperature, humidity, atmospheric pressure, wind)
  3. Terrain (e.g. proximity to water, gras, buildings)
  4. Proximity to a gym (within 200 meters, 1 kilometer and 5 kilometers)
  5. Co-occurrence to each of ~150 pokemons ever spotted in the radius of 200 meters, 1 kilometer and 5 kilometers
  6. Co-occurrence to each of ~150 pokemons spotted within the last hour in the radius of 200 meters, 1 kilometer and 5 kilometers
  7. Population size

And their descriptions:

  1. I would make 3 features out of it (long, lat and the s2 ID)
  2. These will be nominal features
  3. These I imagine to be binary features
  4. Please provide here 3 binary features. Provide yes if the location is within 200 meters to a gym and no otherwise. Do the same also for the radius of 1 and 5 km.
  5. Altogether there are 151 pokemons, which should result in 453 features. Provide for each Pokemon a yes if a Pokemon has ever been spotted within 200 meters to a location and no otherwise. Do the same for the radius of 1 and 5 km.
  6. Similar to 5.
  7. Here anything can work, either population size of the area or the number of buildings or just a classification of whether the area has >10 mio inhabitants, 1-10mio, 200-500k, <200k

Once we have all these features, we'll do feature selection to identify those that are most contributing for the prediction. Now is the goal to come up with as many as possible candidate features. Feel free to suggest more features and add them to the list.

Please assign each of the points we have so far in the list to a different member of your group. Reach out to us for help :)

gyachdav commented 8 years ago

this table might be of interest

http://pokemongohub.net/pokemon-go-spawn-rate/

bensLine commented 8 years ago

@gyachdav that's definitely interesting, thanks! do you know how much time it takes from one spawn to another? I'm not sure how to map the occurrence rate to time intervals. Probably not straight forward :p

@goldbergtatyana thanks for the list :) feature 5. and 6. however, are not yet clear. We discussed them in our team a bit, and it seems like this would require a classifier for each location (at least cell).

E.g. feature 5: if we add 'co-occurrence of pokemons ever spotted in a radius' we need to refer to a location. As a result, we end up with different data sets for all locations. And for feature 6 we have to create a new classifier constantly, as we only consider sights within the last hour according to the request timestamp.

This seems like we will calculate features 5. and 6. for every request (consisting out of location and time). Is this right? If so, as a result, we have to recalculate feature 5. and 6. for the entire training set and then retrain the SVM for every request. This would take some time I guess.. is this feasible or do I get something wrong?

goldbergtatyana commented 8 years ago

Hi @bensLine! The idea is actually a much simpler one. :)

Let's discuss it with you and your team members tomorrow on Skype, as suggested, and summarize the tasks for features 5 & 6 tomorrow again.

gyachdav commented 8 years ago

@bensLine what we know is that they used 100M data points over a period of one week. so if Pidgey appeared 1598 per 10k sightings it should roughly appear ~1.6M times in the 10M dataset or ~1.6 in one week. Roughly every 1/3 of a second. Main problem is of course that spawns are not spread evenly over time (as the spawn most active column suggests). So now we need to figure out what is the spawn distro for Pidgey's. I'm pretty sure you can get this distro from the API project A provides. Then instead of dividing 1.6M/600k seconds a week just distribute those sightings across the number of seconds you back out of the distro (make sure you scale this number to one week). That should give you your answer.

semioniy commented 8 years ago

@goldbergtatyana, @gyachdav, if I understand right, these are all features to be implemented. Is that right?

  1. Location in form of:
    1. a. Lat/ long
    2. b. S2 Cell #5
    3. c. Timezone (for time parsing)
  2. Time:
    1. a. UTC
    2. b. Local time in #8
      1. b. i. Classical form
      2. b. ii. Minutes from midnight
      3. b. iii. Part of the day (morning, noon, afternoon, evening etc.)
      4. b. iv. Time before/ after sunrise/ sunset
  3. Weather: #6
    1. a. Temperature
    2. b. Rain/ sunny etc.
    3. c. Barometer
    4. d. Humidity
    5. e. Wind (Force, direction)
  4. Terrain specifications: #7
    1. a. Grass/ asphalt/ building
    2. b. Proximity to water
    3. c. Maybe altitude (is it mountain?)
  5. Proximity to a jym
    1. a. Source: map of jyms/ ingress map?
    2. b. Proximity in meters as nominal attribute (100/200/300/500/750/1000m)
  6. Co-occurrence to each of ~150 pokemons ever spotted in the radius of 200/1000/5000m
  7. Co-occurrence to each of ~150 pokemons spotted within the last hour in the radius of 200/1000/5000m
  8. Population size (of the city)
  9. Population density in that area (if we find a source)
  10. Same for location popularity (network usage as indicator?)
  11. Same for network coverage
  12. Split pokemons by their type (fire/ grass)
gyachdav commented 8 years ago

You currently have no descriptive features of the Pokemon so I would add:

  1. average CP
  2. Average HP
  3. Weight
  4. Height
  5. Type of attacks (you can have individual binary feature for each attack type)
  6. Weakness Any other descriptive feature.

@goldbergtatyana what do you think? We can alternatively just use the pokemonId as a representation of all those descriptive features but maybe expanding the feature set is worthwhile here?

On Sep 5, 2016, at 11:28 PM, semioniy notifications@github.com wrote:

@goldbergtatyana, @gyachdav, if I understand right, these are all features to be implemented. Is that right?

Location in form of: a. Lat/ long b. S2 Cell (Cell size?) c. Timezone (for time parsing) Time: a. UCT b. Local time in

  1. b. i. Classical form
  2. b. ii. Minutes from midnight
  3. b. iii. Part of the day (morning, noon, afternoon, evening etc.)
  4. b. iv. Time before/ after sunrise/ sunset Weather: a. Temperature b. Rain/ sunny etc. c. Barometer d. Humidity e. Wind (Force, direction) Terrain specifications: a. Grass/ asphalt/ building b. Proximity to water c. Maybe altitude (is it mountain?) Proximity to a jym a. Source: map of jyms/ ingress map? b. Proximity in meters as nominal attribute (100/200/300/500/750/1000m) Co-occurrence to each of ~150 pokemons ever spotted in the radius of 200/1000/5000m Co-occurrence to each of ~150 pokemons spotted within the last hour in the radius of 200/1000/5000m Population size (of the city) Population density in that area (if we find a source) Same for location popularity (network usage as indicator?) Same for network coverage Split pokemons by their type (fire/ grass) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
goldbergtatyana commented 8 years ago

Thanks @semioiy for the list! It looks complete to me. Can you please make a todo list out of it where we can strike through items that are already completed. It would be very nice to have such a list for an overview of what is done.

@gyachdav: the descriptive features for a Pokemon that we will predict won't be available (we won't know it's height and weight) and if we limit ourselves for predicting location of Pokemon A with a height X then we will limit ourselves too much. Rather we want to predict A of all heights. So, I think these descriptive features are not of a big help for our prediction task.

gyachdav commented 8 years ago

@goldbergtatyana not sure I understood that last part. The descriptive features of a pokemon are available to us as much as the pokemonID is available to us. If we predict a pokemon name to appear at a certain time and location then we're also predicting what height,weight, avg CP etc it will be.

semioniy commented 8 years ago

//TODO

goldbergtatyana commented 8 years ago

We agreed with @gyachdav to have the features describing a Pokemon as a simple lookup. No need to implement height, weight and other descriptive features for our ml model.

juanmirocks commented 8 years ago

Gyms, possible more features (as long as it's possible to extract from data API):

We know that Gym values change often. However, the highest ranking of the last winning team can give us an estimation of the level of trainers in the area.

Pokestops, possible more features (as long as it's possible to extract from data API):

Lures, possible more features (as long as it's possible to extract from data API):

We know that lures disappear fast. Yet, I think we could get typical areas where lures are common == many pokemon trainers == many pokemons (related to pokemonbs)

Please pokemon go experts :-) think of more possible features related to gyms, pokestops, and lures -- Or anything else you think it's relevant to get the presence of pokemon

MatthiasBaur commented 8 years ago

All features in the first list have been implemented. Pokestops have been implemented similarily to gyms as binary features based on several distances. The next step is to cut down the number of features again. I will reference the corresponding issue shortly.

MatthiasBaur commented 8 years ago

62