Closed sacdallago closed 8 years ago
Hi @bensLine @semioniy @Aurel-Roci ! How are you getting along with the machine learning tutorial? Is all clear and you have a good feeling on how to operate weka?
Hi, @goldbergtatyana. Not really good. It's about clear what you said in tutorial, but it's hard to go beyond that toward real understanding what it does and how it does it. Still trying to figure out though.
@bensLine, @Aurel-Roci, how do you think, maybe we should consider the level of the trainer as well? This data won't probably be really hard to gather, but maybe it influences, how common/ rare are the pokemon the one finds?
@semioniy from what I have found from friends that play the game, at least for them it depends on the area, not on the level they have. The only thing that differs depending on the level is the CP the pokemon have.
Hi, @goldbergtatyana the tutorial was good, and since I was there I got a better understanding of it. But what I am having problems with now, is using the dummy data provided here https://github.com/gyachdav/pokemongo; how to write them in an .arff file to test them on weka. Anyway I can get some help with that?
Hi @goldbergtatyana, sorry for the delay I was on holidays. Thanks for the tutorial it was great for refreshing, I used Weka already.
However, the way I understood the project is that the TLN predictions will be query based and there are two use cases:
@Aurel-Roci I committed a script to parse the data into an .arff file (https://github.com/PokemonGoers/PredictPokemon-2/tree/feature/test_data). It's very basic and might not be good JS practice - haven't used JS before :p I did not extract all data from the JSON file, e.g. id of the entry, as it is unique, or user id, as I don't think it influences the appearance of Pokemon. However, feel free to adapt the script.
I also added one more feature: the distance to a reference point for testing purposes. I wanted to extract time features too, but most of the timestamps in the dummy data are all within a 5min window, so we cannot really use that data to find appropriate time intervals. However, we could slice the day into intervals and then use them as features too, e.g. 3h slices. Maybe some Pokemon appear more likely in the evening.
@semioniy concerning the trainer level, I would go with Aurel's answer
@sacdallago the hypothesis sounds good, we have to research if we can find a data source to extract features for that. I did not yet look into it.
@bensLine you commited directly on develop. Not good. Please use GitFlow/Feature model!!! Read some of the first announcement emails!.. From now on, we deduct points to who commits without opening PRs
Hi Machine Learners! Sorry, I was absent yesterday, but now I am back :wave:
@bensLine it is great how you summarized the ML problem. Really, good job!
At the moment I can also think of only these two queries. However, for the second one we can also set a time frame of say 5 minutes or half an hour. Then only the location will be unknown.
@Aurel-Roci et al.: did you get the arff file from the dummy data? did you already try apply any of the ML algorithms on it?
Thanks @goldbergtatyana, it is nice to have the time frame for the second query.
Here are some ideas about features:
However, those features usually rely on an API to get some data. But most of the APIs allow only a limited amount of requests per day. I assume that those limits could be reached and our IP and/or API key get blocked. Should we ignore this or are there other ideas(several ips/keys, ...)?
What do you think about the features, are they good, should we integrate them?
So, what do you think about the different features/data sources? Which one should we implement and how should we deal with the query limits?
Furthermore, should we actually use lon and lat directly as features? I read that people rasterize the earth surface and use cell ids instead. E.g. by using Google's S2 library cells with up 1 cm² can be created. However, this would end up in a huge value range even with large cells. Anyway, we would not have the continues lat/lon values anymore and could express the location as a single feature. S2 also preserves spatial locality. But again, not sure if it's worth to implement that. is it a good idea? There are JS ports of the library. One is actually supposed to be used by pokemon go itself. https://github.com/Daplie/s2-geometry.js
And related to that: we consider the whole globe, right? Or can we focus on a smaller area, e.g. only Europe?
Great points @bensLine!!!
Let us get back to you shortly about the features you suggested.
As to Google's S2 library: it looks like something what we exactly need! Using a grid of 1cm2 is prob too much for us. What we rather have in mind is the following:
Query 1: a user opens an app and we predict a Pokemon at every 200m (or 500m or 1km - we'll need to see which one performs best with out ml tool) within a square of 10 x 10 km. So, a user can zoom out and move the map within a square of 10 x 10 km without a need for our ml to redo a calculation. If a user moves out from this square, then we'll need to run our ml method again for a new 10 x 10 km square.
Query 2: we rasterize the earth surface at a very broad grained level (eg in squares of 100km in length, maybe even more) and then we predict pokemons globe-wide.
For both queries we run predictions for the next 30mins or hour (again, dependent on where ml tool performs best) and the pipeline should be the same.
@gyachdav
@bensLine the features you're suggesting sound promising. However at this point I would recommend trying to simplify then over stretching yourself. Check out this simple tutorial first - http://tutorials.pluralsight.com/big-data/use-a-data-analytics-tool-to-predict-where-the-pokemon-are-going-to-appear
The only additional features beyond TLNs are weather related features and proximity to grass, water, buildings.
I would suggest you:
Once we mastered working with spatial data and predictions we can move on to test other features. But let's get the basics first!
@gyachdav @goldbergtatyana thanks for the advice! We'll look into that.
As a proof of principle we can train a prediction method already now based on dummy data. Task #1: upload here a working arff file
Hey all,
Let's have the features summarized here again:
And their descriptions:
Once we have all these features, we'll do feature selection to identify those that are most contributing for the prediction. Now is the goal to come up with as many as possible candidate features. Feel free to suggest more features and add them to the list.
Please assign each of the points we have so far in the list to a different member of your group. Reach out to us for help :)
this table might be of interest
@gyachdav that's definitely interesting, thanks! do you know how much time it takes from one spawn to another? I'm not sure how to map the occurrence rate to time intervals. Probably not straight forward :p
@goldbergtatyana thanks for the list :) feature 5. and 6. however, are not yet clear. We discussed them in our team a bit, and it seems like this would require a classifier for each location (at least cell).
E.g. feature 5: if we add 'co-occurrence of pokemons ever spotted in a radius' we need to refer to a location. As a result, we end up with different data sets for all locations. And for feature 6 we have to create a new classifier constantly, as we only consider sights within the last hour according to the request timestamp.
This seems like we will calculate features 5. and 6. for every request (consisting out of location and time). Is this right? If so, as a result, we have to recalculate feature 5. and 6. for the entire training set and then retrain the SVM for every request. This would take some time I guess.. is this feasible or do I get something wrong?
Hi @bensLine! The idea is actually a much simpler one. :)
Let's discuss it with you and your team members tomorrow on Skype, as suggested, and summarize the tasks for features 5 & 6 tomorrow again.
@bensLine what we know is that they used 100M data points over a period of one week. so if Pidgey appeared 1598 per 10k sightings it should roughly appear ~1.6M times in the 10M dataset or ~1.6 in one week. Roughly every 1/3 of a second. Main problem is of course that spawns are not spread evenly over time (as the spawn most active column suggests). So now we need to figure out what is the spawn distro for Pidgey's. I'm pretty sure you can get this distro from the API project A provides. Then instead of dividing 1.6M/600k seconds a week just distribute those sightings across the number of seconds you back out of the distro (make sure you scale this number to one week). That should give you your answer.
@goldbergtatyana, @gyachdav, if I understand right, these are all features to be implemented. Is that right?
You currently have no descriptive features of the Pokemon so I would add:
@goldbergtatyana what do you think? We can alternatively just use the pokemonId as a representation of all those descriptive features but maybe expanding the feature set is worthwhile here?
On Sep 5, 2016, at 11:28 PM, semioniy notifications@github.com wrote:
@goldbergtatyana, @gyachdav, if I understand right, these are all features to be implemented. Is that right?
Location in form of: a. Lat/ long b. S2 Cell (Cell size?) c. Timezone (for time parsing) Time: a. UCT b. Local time in
- b. i. Classical form
- b. ii. Minutes from midnight
- b. iii. Part of the day (morning, noon, afternoon, evening etc.)
- b. iv. Time before/ after sunrise/ sunset Weather: a. Temperature b. Rain/ sunny etc. c. Barometer d. Humidity e. Wind (Force, direction) Terrain specifications: a. Grass/ asphalt/ building b. Proximity to water c. Maybe altitude (is it mountain?) Proximity to a jym a. Source: map of jyms/ ingress map? b. Proximity in meters as nominal attribute (100/200/300/500/750/1000m) Co-occurrence to each of ~150 pokemons ever spotted in the radius of 200/1000/5000m Co-occurrence to each of ~150 pokemons spotted within the last hour in the radius of 200/1000/5000m Population size (of the city) Population density in that area (if we find a source) Same for location popularity (network usage as indicator?) Same for network coverage Split pokemons by their type (fire/ grass) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks @semioiy for the list! It looks complete to me. Can you please make a todo list out of it where we can strike through items that are already completed. It would be very nice to have such a list for an overview of what is done.
@gyachdav: the descriptive features for a Pokemon that we will predict won't be available (we won't know it's height and weight) and if we limit ourselves for predicting location of Pokemon A with a height X then we will limit ourselves too much. Rather we want to predict A of all heights. So, I think these descriptive features are not of a big help for our prediction task.
@goldbergtatyana not sure I understood that last part. The descriptive features of a pokemon are available to us as much as the pokemonID is available to us. If we predict a pokemon name to appear at a certain time and location then we're also predicting what height,weight, avg CP etc it will be.
We agreed with @gyachdav to have the features describing a Pokemon as a simple lookup. No need to implement height, weight and other descriptive features for our ml model.
Gyms, possible more features (as long as it's possible to extract from data API):
We know that Gym values change often. However, the highest ranking of the last winning team can give us an estimation of the level of trainers in the area.
Pokestops, possible more features (as long as it's possible to extract from data API):
Lures, possible more features (as long as it's possible to extract from data API):
We know that lures disappear fast. Yet, I think we could get typical areas where lures are common == many pokemon trainers == many pokemons (related to pokemonbs)
Please pokemon go experts :-) think of more possible features related to gyms, pokestops, and lures -- Or anything else you think it's relevant to get the presence of pokemon
All features in the first list have been implemented. Pokestops have been implemented similarily to gyms as binary features based on several distances. The next step is to cut down the number of features again. I will reference the corresponding issue shortly.