RadiusNetworks / doc_repo

MIT License
0 stars 0 forks source link

Improve ability to scale requests #6

Closed cupakromer closed 7 years ago

cupakromer commented 7 years ago

Background

The default Github v3 rate limit for unauthenticated APIs is about 60 requests per hour. After this all requests return 403 Forbidden; this in turn causes a DocRepo::NotFound (which is a misleading error) to be raised:

If you exceed the rate limit, an error response returns:

HTTP/1.1 403 Forbidden
Date: Tue, 20 Aug 2013 14:50:41 GMT
Status: 403 Forbidden
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1377013266

{
   "message": "API rate limit exceeded for xxx.xxx.xxx.xxx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
   "documentation_url": "https://developer.github.com/v3/#rate-limiting"
}

Github gives us several solutions / workarounds for this:

Given this low default rate limit is easy for an app to hit this periodically. The current doc repo design doesn't provide a way for an app to implement any of the work arounds or configure things to use an authenticated request with a higher rate limit.

Proposal

Add the following configuration settings to allow for app level configuration of authenticated requests, caching API responses, and/or conditional GET support:

DocRepo.configure do |c|
  c.client_id = "OAuth2 Client Id"
  c.client_secret = "OAuth2 Client Secret"

  # Arbitrary cache store, with optional options, which responds to:
  #   - `fetch(name, options = nil)`
  #   - `write(name, value, options = nil)`
  c.cache_store = Rails.cache
  c.cache_options = {
    expires_in: (Integer(ENV['DOC_REPO_CACHE_EXPIRY']) rescue 24.hours),
    race_condition_ttyl: 2.minutes,
    namespace: :docs,
  }
end

The cache will be used to both store API responses and for internal conditional GET support without additional configuration.

Additionally, some sort of change should be made to DocRepo.respond_with or Repository#respond to support passing in the external app's conditional GET request headers. Or the design should be re-worked in such a way as to make that available through an alternative call system.

csexton commented 7 years ago

I :heart: the caching, and think it would likely eliminate the 403's from Github's API.

Do you have any specific headers in mind that you want to be able to pass into respond_with, I am all for adding that ability but wanted to make sure it solves a problem before adding codes.

cupakromer commented 7 years ago

I was focusing on cache related things with that so the normal conditional GET headers: If-None-Match (E-Tag) and If-Modified-Since. Though I did state possibly changing the object structure / message passing protocol as an alternative to passing those in.

The reason being that some modern browsers, I'm looking at you Chrome, provide one of both of these headers on a page "refresh" or "restore from previous session" when the page is in local cache. So it would be good to be able to reply to them with a 304 Not Modified - also it would potentially be faster b/c no rendering of the response.

cupakromer commented 7 years ago

I did a bit of testing today to try out some different redesign ideas. During this process I ended up using curl to request an image through the current redirect url:

curl -Is "https://github.com/RadiusNetworks/proximitykit-documentation/raw/master/docs/pk-cloud.png"
HTTP/1.1 302 Found
Server: GitHub.com
Date: Thu, 22 Jun 2017 18:31:55 GMT
Content-Type: text/html; charset=utf-8
Status: 302 Found
Cache-Control: no-cache
Vary: X-PJAX, Accept-Encoding
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 100
Access-Control-Allow-Origin: https://render.githubusercontent.com
Location: https://raw.githubusercontent.com/RadiusNetworks/proximitykit-documentation/master/docs/pk-cloud.png
X-UA-Compatible: IE=Edge,chrome=1
X-Runtime: 0.017536
Content-Security-Policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; child-src render.githubusercontent.com; connect-src 'self' uploads.github.com status.github.com collector.githubapp.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-src assets-cdn.github.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; img-src 'self' data: assets-cdn.github.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; media-src 'none'; script-src assets-cdn.github.com; style-src 'unsafe-inline' assets-cdn.github.com
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
Public-Key-Pins: max-age=5184000; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="RRM1dGqnDFsCJXBTHky16vi1obOlCgFFn/yOhI/y+ho="; pin-sha256="k2v657xBsOVe1PQRwOsHsw3bsGT2VzIqz5K+59sNQws="; pin-sha256="K87oWBWM9UZfyddvDfoxL+8lpNyoUB2ptGtn0fv6G2Q="; pin-sha256="IQBnNBEiFuhj+8x6X8XLgh01V9Ic5/V3IRQLNFFc7v4="; pin-sha256="iie1VXtL7HzAMF+/PVPR9xzT80kQxdZeJ+zduCB3uj0="; pin-sha256="LvRiGEjRqfzurezaWuj8Wie2gyHMrW5Q06LspMnox7A="; includeSubDomains
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-Runtime-rack: 0.020754
X-Served-By: 7d2a2d05162492046d9960cdbc326796
Age: 0
X-GitHub-Request-Id: D5DB:4C9C:2CFB8EA:4161CAD:594C0D1B

That's when I noticed this was another redirect. Following this redirect yields:

curl -Is "https://raw.githubusercontent.com/RadiusNetworks/proximitykit-documentation/master/docs/pk-cloud.png"
HTTP/1.1 200 OK
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'
Strict-Transport-Security: max-age=31536000
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
ETag: "fd51ed6a84d1ee36c3b0af994dea8c32c785bb29"
Content-Type: image/png
Cache-Control: max-age=300
X-Geo-Block-List:
X-GitHub-Request-Id: AB1E:1776:F1BF31:F95F97:594C0D02
Content-Length: 18411
Accept-Ranges: bytes
Date: Thu, 22 Jun 2017 18:34:09 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-iad2125-IAD
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1498156449.322911,VS0,VE1
Vary: Authorization,Accept-Encoding
Access-Control-Allow-Origin: *
X-Fastly-Request-ID: a440cac380cd0e67b5ebdc40c5d8a4185be58bf1
Expires: Thu, 22 Jun 2017 18:39:09 GMT
Source-Age: 159

Note that the initial URL we redirect to does include rate limits:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 100

However, the final URL we are redirected to does NOT have rate limits!

That's when I noticed that we use two different URLs: one for redirects and one for reading files. The former uses the domain github.com while the latter uses api.github.com.

From what I can tell the rate limits break down as follows (I retested with a markdown page as well to verify behavior was consistent across file types):

Domain Limit Reset
github.com 100 N/A
api.github.com 60 1 hour
raw.githubusercontent.com None N/A

It should be noted that github.com always redirects cURL requests to the raw.githubusercontent.com domain.

I'm not sure why we're using two different URL domains. Both were introduced in the same commit 6efa102 but no details differentiating their use. @csexton and thoughts on this?

csexton commented 7 years ago

The lack or rate limits on raw.githubusercontent.com is very interesting, and would probably be better for this library.

I knew about that domain when I was originally working on this, however didn't use it since it seemed like an implementation detail on how github works. I opted to go with what seemed like a more public (less likely to change?) endpoint.

I suspect I intended to use api.github.com for everything and the fact that DocRepo::GithubFile#redirect_url uses github.com was just an oversight. Despite this oversight I like the idea of building the urls for assets to use raw.githubusercontent.com for now.

Less redirects, and no rate limits = win.