PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.
Apache License 2.0
9 stars 6 forks source link

Pagination for the api calls #119

Closed swathi-ssunder closed 8 years ago

swathi-ssunder commented 8 years ago

Currently the api responds with all the records, without any limit. In this case, the api response will never be complete and hence the api will actually become unusable. The reason the apis are currently working is because the listen scripts are not running indefinitely, due to which there are just finite records in the db.

So there needs be a way to query the records in pages, eg: Say 1 to 1000 records, 1001 to 2000 and so on.

Also see https://github.com/PokemonGoers/PokeData/issues/111#issue-174701397

sacdallago commented 8 years ago

It's an important issue, because API calls are stateless thus the pagination should happen in the client. A nice approach would be to set a hard limit (100 results?) and then add a query parameter defining the upperbound in term of timestamp (because your records are time-bound) for that limit. I was actually not the only one having this idea (see first answer): http://stackoverflow.com/questions/13872273/api-pagination-best-practices

jonas-he commented 8 years ago

Guys thoughts in #71 :

... However do not use pagination on the API this is just bad practice. Instead have a limit number of sightings by default. Then privide controls to limit the number of sighting whether by count, area, period or payload size in Kb).

vivek-sethia commented 8 years ago

Christian thoughts :

... API calls are stateless thus the pagination should happen in the client. A nice approach would be to set a hard limit (100 results?) and then add a query parameter defining the upperbound in term of timestamp (because your records are time-bound) for that limit

Guys thoughts :

.. However do not use pagination on the API this is just bad practice. Instead have a limit number of sightings by default. Then provide controls to limit the number of sighting whether by count, area, period or payload size in Kb).

Both the thoughts are asking to avoid pagination in the API but asking to keep the limit by default but in that case how would the Machine learning team get all of the sighting data? So they will query in iteration with the page number as we know the set of records to be fetched ( since we know the default size, assume 100 records). Am I on the right track or is there something else which needs to be done ?

OR

by default always restricting pokemon sightings data based on latitude, longitude and startTime and endTime as suggested in the issue : #111

@sacdallago

sacdallago commented 8 years ago

I'm not thinking about the ML guys for a moment, I'm thinking about the standards and best approach. I like the #111 idea but I would anyway implement a hard limit on the results that can be "overwritten" using a time-based feature.

For the ML guys, worst case scenario, they get a dump of the DB to work with! Advantages of working directly with the people collecting the data :dancer: They are anyway only gonna need it for training purposes, so they won't need to collect this data indefenetly, just once they have the best method, features, aso.

@gyachdav @goldbergtatyana confirm?

gyachdav commented 8 years ago

@sacdallago - yes, confirmed.

Just to sum up:

  1. no pagination in the API
  2. sightings are limited by default (I suggest last 2500)
  3. set location/time/size limits
  4. no special consideration should be made when designing the API for the ML effort (they may just get a data dump)