BSData / bsdata

BattleScribe data file hosting platform
http://battlescribedata.appspot.com/
94 stars 51 forks source link

Bandwidth and release system drastic improvement #112

Closed amis92 closed 5 years ago

amis92 commented 8 years ago

So, the idea here is to essentially reduce workload done by appspot app.

The main point is to use GitHub feature of release management as an even more important part of data distribution. Currently releases are used as tags. The potential lies in .zip included automatically with each release. It includes the snapshot at that tag, already zipped. If we could redirect people (BattleScribe actually) to download that file, the workload on our webapp would be down to just .bsi creation and distribution.

An even more brave idea would be to create a repo managing web app for us, data admins. Currently we have 49 repositories. It is unviable to administer all of them on par with some standard. So, the main part would be release agent (not the one used in resin miniature manufactures :P ) which would follow all repositories and aggregate important data - ie. if there are any unmerged pull requests, how old are they, when was the latest release, are there any new unreleased changes etc etc. It could automatically create releases for repos that are scarcely maintained, and maintainers forgot to release something, after a threshold of say 7 days.

But the pivotal role of that agent would be to automatically create releases which would include index.bsi.

That way, bsi origin address could be for example https://github.com/bsdata/wh40k/releases/latest/index.bsi

And that way, the only thing we would have to do on appspot website is to aggregate and redirect users to GitHub links. Zero bandwidth required. GitHub manages all the fuss, and data admins have a much better overview on the global situation.

I can create the release agent webapp some time in summer, but it'd require BattleScribe to support downloading .zip files as a kind-of .bsr format (but without that format file extension, as we'd be using GitHub's in-built feature). Then, BattleScribe would download the whole repo and just extract only the updated files.

Jonskichov commented 8 years ago

So, generally I totally agree :)

We've had some rate limit issues over time, which is why I ended taking the "quick and dirty" action of increasing the appspot cache refresh timer to 2 hours now... Which is pretty bad. FYI, here's GitHubs rate limit documentation: https://developer.github.com/v3/#rate-limiting

I'm gonna bullet-point my thoughts on your post, in no particular order... ;)

1) The Appspot site is (of course) open source (and the comments are OK) so anyone is free to chip in. The site really does need some love, but (at least right now) I can't guarantee being able to put the time in myself. If you're happy to take on some of this then that's awesome! I can also Tweet/Facebook/Whatever to try and drum up some contributors if needed.

2) I made some changes... oooh a long time ago... and totally screwed up the caching mechanism. It's pretty shitty right now to be honest. I'm not proud of it, but at the time it worked and I had to move on to other things. Fixing this would probably help a lot, at least when it comes to GitHub's request limit. This might be an angle for a quick fix to avoid GitHub's rate limit.

3) With regard to taking the load off of Appspot, I actually disagree. I'm happy to foot the bill for this, and it is entirely manageable. Appspot doesn't limit resources, as long as said resources are paid for, but GitHub has a limit. Feel free to do any computational/bandwidth related tasks on Appspot - in fact I feel it's better to offload such tasks to Appspot rather than GitHub. Put the load onto Appspot.

4) With regard to no.3 above, the Appspot site is serving between 5 and 10 GB of data PER DAY. Yes, this is... wow... But Appspot's rates are very reasonable and the real issue here is keeping GitHub happy. The goal should be to minimise the Appspot-to-GitHub bandwidth/requests, and instead shift it to App-to-Appspot bandwidth/requests.

5) So... The GitHub release zip files. You're right, here lies the key. Right now, when the Appspot file cache expires, it checks for new releases in the GitHub repos. If a new release exists, it downloads all the files, caches them and generates a .bsi for them. This hits the GitHub request limit real hard. Instead it could download the release zip, extract it, cache the files contained within, then build the .bsi. This would solve GitHub's request limit (at least until we had 5000+ repos!).

6) Probably best to keep thing split up into the .bsi file and all the separate .catz/.gstz files. BattleScribe is smart enough to download just the files it needs (based on the information in the .bsi), rather than the whole lot, saving bandwidth and requests. Downloading full .bsr or .zip files for each user is going to hit bandwidth harder than necessary. But letting Appspot download the zip, the distribute the contents to users would be fine.

OK, hope that's all useful :) The idea of building a more capable author/repo management system is brilliant though! I hope all my crap provides some useful insight, but do know I'm 100% behind the idea.

TL;DR: If you want to go ahead and build a better Appspot site, with better author tools, release tools and whatnot, then I'm all for it! Sounds amazing :D Anything to make you guys' lives easier. Don't be concerned with Appspot limits, in fact go ahead and leverage Appspot as much as you need. The primary concern is to help data authors.

amis92 commented 8 years ago

Hmm. I'm not sure, but when it comes to GitHub, I can't find any mention of bandwidth limits. Instead, there is Request Limiting - but mind you: for unauthenticated requests, this would be 60 per hour per user IP. So, any app user could make up to 60 requests per hour. Which, in case we'd go with the idea of BattleScribe being able to handle release zip files, would mean any user could possibly download up to 60 updates per hour. Which in all honesty, is not going to happen.

So, the new version of my idea would be like so:

Huh? So, we have backwards compatibility, but BattleScribe 2.0 could leverage the new mechanism.

Again, if you find there's bandwidth limit on GitHub, please tell me, I'm still searching for it. Other than that, I think traffic GitHub-BattleScribe install is the simplest and safest, and least heavy for AppSpot, while still being within reasonable limits.

And I fully understand that you're engaged with BattleScribe 2.0 right now, and I'm most interested for you to stay so :D I can handle the webapp myself, I think.

So, the actual request now is, could index.bsi be updated to optionally include URL for any zipped file containing bunch of catalogues? And BS2.0 would have some policy (maybe user-triggered) to optionally or by default use that?

amis92 commented 8 years ago

Hm. Now I've read some documentation again, I'm not sure - but it's possible we could even completely keep BS app users out of GitHub API limits:

https://developer.github.com/v3/repos/releases/#get-a-single-release-asset

There the response provides two links: one for API, one called "browser_download_link". I wonder, since the latter has no mention of API and also the URL seems like not being API related, would using that URL count against API rate limiting? I think it doesn't. After making a few such requests and spying the http traffic, it seems completely unrelated to API. No rate-limit headers, no API URLs. I think it's safe to say that it's the best content delivery method we can get.

Space0din commented 8 years ago

Hey, Amadeusz sent me here to inquire about a separate but equal optional repository for homebrew codices, so they could be used alongside regular ones.

I had a thought--why not create a subcategory within Force Types that would allow access to homebrew codices?

E.g., "CAD - Homebrew"; "Formation Detachment (HB)"

Surely that would be less troublesome than a separate repository?

amis92 commented 8 years ago

The actual problem we'd have persists: people downloading the 40k repo will also unintentionally download these homebrew cats. That's what we're trying to avoid and separate repo is the only way to go, I think.

aetherith commented 7 years ago

Sorry to necro this thread but I got curious and started playing with the new v4 GitHub API. Below is a GraphQL query that provides the following info:

All of that data can get pulled for 2 rate points out of what looks like an hourly quota of 5000. I haven't looked into the mutations that the API allows but I bet with a little fussing you could have it auto close PRs and create releases on a timer of some sort. If y'all are still interested in a dashboard of some sort and/or have any suggestions for platform I'd be curious to explore it further. Here's the query and an example of the return. You can check it out yourself here...

EDIT: Changed to PasteBin links per request.

Query: GitHub GraphQL Query for Repo Data

Response: GitHub GraphQL Example Response

amis92 commented 7 years ago

Looks great, but please paste it into some pastebin or other place (and replace text in issue with the link), it makes reading the issue very hard. I can't even write a sensible response because I don't see your comment when writing it. :D

aetherith commented 7 years ago

Updated. My bad, got excited and didn't even consider that it was a readability nightmare.

Two additional things I noticed while going through documentation:

amis92 commented 7 years ago

Okay, so: there might be an effort to move the webapp over to .NET Core when talking about platform. Keep Angular front-end and switch backend to .NET.

I believe using webhooks will be much more efficient time-wise and cpu-wise.

We are most definitely interested in a dashboard of some kind. Would you be able to make frontend for it?

aetherith commented 7 years ago

Ok so instead of Java, using .NET Core and instead of using the GitHub API to get information on repo status using webhooks run on push? As for doing a front end dashboard...I could give it a shot. To be honest my skills are much more back end oriented but doing a responsive front end would be a good learning experience. Were you thinking of using Angular there as well or some other framework?

amis92 commented 7 years ago

Well, it's angular now so sure, I thought angular too moving forward, but if you have other preferences... I have honestly no idea what other things are there, and are good. Heard of React and many others, but the world of JS is unknown to me still.

It'd be good to choose something that's in top of popularity as there's a greater chance someone can help us in future.

As for the concept, yeah. That's about right.

NistrumCain commented 5 years ago

OK, @amis92 (this is me in my non work account) im going to say ill commit some time into this and see if i cant help you guys out. the new API for github is graphQL based i think (based on a very quick look) so if i am lucky i might well be able to handle this really nicely... will have to wait and see.

Im a react(a js framework) dev by day so while this isnt exactly my sort of code i should be able to get what we need out of it.

amis92 commented 5 years ago

Oh maaan. Would you also oh so love to rewrite the frontend as well sometime in future? It's really just two views (list and details). 😇

I can help with backend anyway you want.

NistrumCain commented 5 years ago

yeah, what do you want from the frontend you don't get now? Also do you know what endpoints the API is calling? i was going to use postman to run some tests

amis92 commented 5 years ago

Let's move the frontend discussion somewhere else.

Do you mean which GitHub API? I'll investigate.

NistrumCain commented 5 years ago

Where do you want to move it to? And yeah, i just want to be able to query Github the way the site does and see if i can get replicate the problem

amis92 commented 5 years ago

Let's open a new issue: #228 (frontend)

For the APIs it'll be this one for release listing: https://developer.github.com/v3/repos/releases/

amis92 commented 5 years ago

Closing in favor of #243