GetAllSightings: Sort buffer overflow

PokemonGoers / PokeData

In this project you will scrape as much data as you can get about the *actual* sightings of Pokemons. As it turns out, players all around the world started reporting sightings of Pokemons and are logging them into a central repository (i.e. a database). We want to get this data so we can train our machine learning models. You will of course need to come up with other data sources not only for sightings but also for other relevant details that can be used later on as features for our machine learning algorithm (see Project B). Additional features could be air temperature during the given timestamp of sighting, location close to water, buildings or parks. Consult with Pokemon Go expert if you have such around you and come up with as many features as possible that describe a place, time and name of a sighted Pokemon. Another feature that you will implement is a twitter listener: You will use the twitter streaming API (https://dev.twitter.com/streaming/public) to listen on a specific topic (for example, the #foundPokemon hashtag). When a new tweet with that hashtag is written, an event will be fired in your application checking the details of the tweet, e.g. location, user, time stamp. Additionally, you will try to parse formatted text from the tweets to construct a new “seen” record that consequently will be added to the database. Some of the attributes of the record will be the Pokemon's name, location and the time stamp. Additional data sources (here is one: https://pkmngowiki.com/wiki/Pok%C3%A9mon) will also need to be integrated to give us more information about Pokemons e.g. what they are, what’s their relationship, what they can transform into, which attacks they can perform etc.

Apache License 2.0

9 stars 6 forks source link

GetAllSightings: Sort buffer overflow #155

Closed johartl closed 8 years ago

johartl commented 8 years ago

curl http://pokedata.c4e3f8c7.svc.dockerapp.io:65014/api/pokemon/sighting/

Error 404: Not Found
{
    "message": "Failure. No sighting details found!",
    "data": "getMore executor error: Overflow sort stage buffered data usage of 33554598 bytes exceeds internal limit of 33554432 bytes"
}

jonas-he commented 8 years ago

@johartl thanks for the info, its a known issue we are working on

jonas-he commented 8 years ago

@sacdallago it seems that the reason for this is that we don't have an index over the "time" attribute of our sightings => sorting has to be done in memory for 2,5 million sightings => buffer overflow. However we defined an index for "time" and we don't know why it is not there. If we try to create the index now, it fails because of duplicates so we can either try to drop duplicates or drop the whole collection and start from scratch (which would be the easiest solution)

gyachdav commented 8 years ago

Define start from scratch.

jonas-he commented 8 years ago

This is what i mean: drop the "pokemonsightings" collection which has ~2.5 million sightings and was the result of running the scrapers for around 12 hours until we hit the storage limit on the mlab instance. Then restart the listening process so that the index for the time attribute gets created and no more buffer overflow when someone requests sightings from our API. Of course i already did a backup of that on my local machine in case someone cries for more data. For a temporary fix i created a new index that ignores duplicates and the api request seem to work now. But as soon as our DB migrates to a bigger instance (on some rostlab server) i would say we start with a new collection to make sure that we have the right index.

sacdallago commented 8 years ago

@jonas-he we can also export the data we have till now in a dump and start anew on the mlab servers. The rostlab server is still giving me all the issues it possibly can, it's starting to become annoying

jonas-he commented 8 years ago

@sacdallago yes we could start anew on the mlab servers, but we would still have the storage limit. only difference to the collection that we now have would be that we will have the index {"appearedOn": -1, "pokemonId": 1, "location": 1}, {"unique": true} instead of just {"appearedOn": -1} (which i created manually to fix the API error temporarily). So no big benefit apart from having no duplicates. I would suggest to wait for rostlab and then start from new.