Bungie-net / api

Resources for the Bungie.net API
Other
1.22k stars 92 forks source link

Bulk data for analytics #352

Open bbickell opened 6 years ago

bbickell commented 6 years ago

Is there a better (faster, less painful for the API to deal with) way to get every post game carnage report? Right now I'm using the naive approach of walking them by activity ID, which is painfully slow relative to the speed at which they're being generated of course.

If there was a bulk download, that'd be fantastic. I'd originally started by only wanting the PGCRs for PVP activity, but of course, there's no filter to get just the activity IDs that were related to PVP modes.

bladefist commented 6 years ago

@bbickell There is not. But what kind of analytics are you trying to do?

bbickell commented 6 years ago

@bladefist I work for a data analytics firm as a day job and I'm always looking for interesting data sets for research, blog posts and demos of high performance databases. Also an avid Destiny 2 fan (of course) so this data set is especially interesting. After looking at the API for a while I quickly figured out the PGCRs were where the interesting bits for player analytics were (DAUs, MAUs, weapon balance/pvp meta, etc...).

vthornheart-bng commented 6 years ago

I always thought that would be fun to play around with as well - I know some sites, such as Destiny Trials Report, get that info currently by literally grabbing every PGCR. We don't have a way to expose that data in bulk at this time, but let's log this as an enhancement request for sometime in the future.

bbickell commented 6 years ago

Thanks for at least considering it, and for also letting me know that I haven't missed a more limit-friendly way to retrieve that data. I'm currently somewhere around activityId 8.6 million so it's unlikely with my current approach I'll ever be able to catch up to recent history. Even if there were a way to walk only PVP events, it'd be something.

I always wondered how Destiny Tracker was doing it, and my guess is that they're simply getting the PGCRs by player once you give them a login name. They can then keep using those names to find new PGCRs to keep things up to date.

Thanks for the response!

vthornheart-bng commented 6 years ago

Yeah - indeed, the most common scenario where people use stats data with the API is centered around looking at a single character: for that, the API works well. For the larger-scale analysis, it's definitely lacking and on a personal level I'd love to see that be made easier!

You should get in touch with the Trials Report folks: I've heard that they've built up an AWS cluster to mine it (being careful about throttling on each of the individual nodes), and that's how they keep up. You might be able to collaborate with them and get access to their data if they happen to have it already accessible in a more bulk-oriented form!

bbickell commented 6 years ago

That makes sense. I can definitely get enough compute for this and I know AWS well enough. My understanding was that my API key was the throttle limit, rather than compute nodes/IPs? Even with basic asynchronous stuff I can hit the throttle pretty easily on a single host.

Or, do you think they've registered multiple applications? I hadn't gone that way yet because I wasn't sure it was in the spirit of the API and I was trying to be a good API citizen!

vthornheart-bng commented 6 years ago

Oh, nay - fortunately in this case the throttle limit is per IP. Admittedly it is a little bit out of the spirit of the API in that when the first version of these endpoints were first created, way back in Destiny 1, we never really pictured that anyone would do this with them: but as long as you make sure to keep yourself well-managed, throttling yourself appropriately per server, we don't mind: particularly if you make sure that you have a mechanism to quickly disable it if need be, and/or to allow yourself to have exponential backoff if you start getting request failures.

We may send you a message and ask you to check out your algorithm if it ends up causing so many requests that it negatively impacts us (this happened a couple of times with Trials Report, because at one point they had too many instances hitting us with too many requests simultaneously), but as long as you keep communication open we're okay with it.

Hit up the Trials Report guys before you jump in though, you might be able to work collaboratively and share the data instead of having to each do that processing separately! I don't know if that's feasible or if they're set up to do that, but it could be worth asking! They will also likely have a great deal of advice to give about some of the oddities you may encounter when pulling PGCR data, and some of the tricks they have used to make themselves robust against those issues. (for instance, how do you know when you're "caught up" with current time? What if you see a skip in PGCR IDs? What if there was a bug in the API causing incorrect data to be returned over some period of time and those PGCRs may need to be reprocessed? These are all problems that the Trials Report folks must have had to resolve one way or another to get their system up and running)