KSP-SpaceDock / SpaceDock-Project

The transitional hub to migrate KerbalStuff.com
10 stars 1 forks source link

Data Mirroring #2

Open ghost opened 8 years ago

ghost commented 8 years ago
bookt-jacob commented 8 years ago

S3 for storage w/ Cloudfront for CDN? Long expiration in CDN should be very effective. Having the data at rest in S3 reduces HTTP requests to the backend servers and should prove much simpler than mirrors.

ghost commented 8 years ago

The upside of using mirrors is, though, that we can defray hosting costs.

Our budget so far is $0.

GenPage commented 8 years ago

I can provide rudimentary mirrors through DigitalOcean in all regions at no cost.

Vekseid commented 8 years ago

I can throw several TB/month into the pool.

SpaceTeph commented 8 years ago

Linux distros use rsync for mirror synchronization - here is how Arch Linux does it. Something similar to the process they use could enable about anybody willing who has a HTTP server running somewhere to donate bandwidth and storage, which at times is easier than donating money.

pjf commented 8 years ago

I have to run and give a talk, but the Internet Archive is happy to host freely distributable content on their servers, which includes all FOSS/CC licensed KSP mods. They have an S3-alike API that's described here.

Yes, the Internet Archive is super-awesome. <3

ghost commented 8 years ago

I can donate some of my bandwidth for a mirror. I can't promise much speed. But I am willing to help.

NecroBones commented 8 years ago

Internet Archive might be really great for this.

For our own mirroring options, I can spare some bandwidth too. Not on the order of what KerbalStuff was using by itself, but I have unused quota each month on my hosting, since my own websites use less than 5% of what I'm permitted, last I looked. If we get enough mirrors involved, each one's bandwidth requirement would be fairly small.

Ristellise commented 8 years ago

github pages can store it well EDIT: will be hosting all OLD downloads on github so stay tuned.

NecroBones commented 8 years ago

I took a look at my Linode account, and I'd have to increase my plan to have enough disk storage for all of the current file data (since there's 62 GB of it). In terms of monthly bandwidth allowance, I have tons of room to spare. It's the disk that's really tight. I'll hold off from doing anything until we know whether we need the mirrors.

dries007 commented 8 years ago

I'm offering up part of my unlimited 250 mbit dedicated in Europe (its in Roubaix, France). It only has 3x 110gb SSD's, so I can't provide a full mirror (there is other stuff running, mostly minecraft servers), but its still might come in handy.

brandonwamboldt commented 8 years ago

I have 500GB of hard drive space with a 1gbps uplink on a dedicated server (with CloudFlare as a CDN in front of it). Would love to help out as well.

sebneira commented 8 years ago

I'll be working on the best solution for this case, as it seems that we have lots of people willing to offer mirrors + the Internet Archive.

@phmayo have you come to any conclusions?

ghost commented 8 years ago

Using the IA requires work on the backend, either the website, or an uploading cron job. @ThomasKerman and VITAS have plenty to do as it is for the moment.

So, much as I want this, we need to focus on getting an easy way for mirrors to be activated and made available first, preferably without any impact on CKAN at all. If that isn't possible, well, we'll deal with that when it's time. Making SD resilient to failure is our top priority, so we don't get another outage like on Monday.

rsync would be the low hanging fruit. Maybe something like SyncThing, though that requires manual intervention, but goes a little easier on the bandwidth.

For a mirror, 100GB storage, 10 TB transfer should be plenty to get us started.

sebneira commented 8 years ago

@phmayo just had a conversation with VITAS and got to a pretty nice design for the solution plus the fact that it's not a priority. I'll get a workflow during the next days.

Will look into SyncThing, thank for sharing!

NecroBones commented 8 years ago

rsync is pretty easy of course. Another possibility is to roll our own process to push out whole files when new ones are added, and nightly (or every 48 hours or whatever) rsync to catch anything that was missed/dropped.

oliverde8 commented 8 years ago

Hi you might wish to check https://about.maniacdn.net/ it is a community driven cdn built for another game. and the sources are public. It uses rsync to sync the files. Anyone can contribute with their server.

dries007 commented 8 years ago

How about web caching, a solution that requires no control over the mirror server, and requires no cronjobs or daemons?

sebneira commented 8 years ago

The main concern here is the following: anyone can poison files with ease, as these are not signed.

@dries007 I don't believe that would be a solution as it wouldn't help with transfer bandwidth nor having distributed data in case of failure.

dries007 commented 8 years ago

Well, you are setting up a deliberate man-in-the-middle structure, but how hard would it be to add hashes to the page to so at least people who want to can check them, that is if you are only serving the download files out of the CDN and not also the main page.

@Sikian I disagree, I'll be using nginx as an example here, but I'm pretty sure you'd be able to apply this to most web server software: You can configure nginx to serve stale content in case of timeout or http 5xx errors, which would allow the cached content to be served in case of error, and if the cache is configured properly, that will be the most requested content, so it'd keep you going. If you'd enable caching on /static/ and on the mods, you are taking most bandwidth issues away right there. Or am I wrong? (I've not implemented this on any larger scale)

oliverde8 commented 8 years ago

@dries007 Not really the bandwith issue is about the network, you server would still need to answer to the same quantity of request and send the same amount of data.

@Sikian I understand I am a trusting person but I can see where that would be going

dries007 commented 8 years ago

@oliverde8 If multiple people have multiple caches running, you can distribute the load and the main server would only have to supply the user specific data, and new files the cache doesn't have yet. This is basically what cloudflare does right? Except that we know better what files are long term cache-able, and which ones are session/page-view specific.

brandonwamboldt commented 8 years ago

@dries007 FYI you can configure caching rules in CloudFlare to tell it what to store short term or long term (it will listen for standard cache control headers).

dries007 commented 8 years ago

@brandonwamboldt I thought that was premium only, good to know.

brandonwamboldt commented 8 years ago

@dries007 There is a limit for the free account (although it will always follow cache headers so you can just set it up via Nginx/Apache). However, if SpaceDock goes with CF I've volunteered to sponsor the premium plan.