ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.89k stars 3.89k forks source link

I2I: AMP Runtime Self-Hosting #25873

Open sebastianbenz opened 4 years ago

sebastianbenz commented 4 years ago

Summary

Make it possible for publishers to self-host the AMP runtime, without breaking AMP validity.

Design document

Motivation

The benefits are:

Launch tracker

/cc @ampproject/wg-approvers @mattwomple @honeybadgerdontcare

ssantosms commented 4 years ago

I can't add comments to the design doc. I would like to comment that Bing likes this and I wanted to add our section with the Google cache that we will do the same rewrite to bing-amp.com that we do today. We already rewrite from ampproject, so, we would have to add that re-write.

ssantosms commented 4 years ago

Another question is around build dist. Will it change? We currently override the cache and 3p addresses so we can host ourselves. Is there any plan to change that?

honeybadgerdontcare commented 4 years ago

I can't add comments to the design doc. I would like to comment that Bing likes this and I wanted to add our section with the Google cache that we will do the same rewrite to bing-amp.com that we do today. We already rewrite from ampproject, so, we would have to add that re-write.

That is great to hear! I've changed permissions so anyone can comment on the design doc.

dvoytenko commented 4 years ago

Couple of notes:

  1. With the OpenJS transition, will we actually be serving JS resources from non-Google infra?
  2. For non-CDN scripts, I'd strongly suggest we require resource integrity. This way we can always fully validate this in our validator and optimize in Cache. And we could always implement the version locking of sorts this way.
honeybadgerdontcare commented 4 years ago

In regards to 1: The plan is for caches to rewrite origin URLs to serve their own version of the Runtime. So yes and no. If it's Google's cache then the JavaScript will come from Google infra. Otherwise if the Runtime is self hosted then it may not be what Google has served in the past.

In regards to 2: Wouldn't this necessitate publishers to republish ALL their AMP documents with each version change?

kristoferbaxter commented 4 years ago

On 1 -- What does the transition change about this? For 2 -- Agree about requiring resource integrity.

kristoferbaxter commented 4 years ago

In regards to 2: Wouldn't this necessitate publishers to republish ALL their AMP documents with each version change?

It wouldn't if the url used in the document contained the version number of AMP used on the document.

honeybadgerdontcare commented 4 years ago

But then if they want to move forward to another version, all documents still need to be updated. It's not sufficient for them to just update whatever was at https://example.com/amp/v0.js.

I guess I don't understand the "and optimize in Cache". The cache can always rewrite it.

If we're worried about a version being too old, there may be other options. One that was suggested was for each Runtime release to have an expiry function. If it's passed a certain time it fetches from cdn.ampproject.org or some other "failure."

TBH, I'm conflicted. Publishers should be able to use whatever they want on origin. If we're concerned that these are served by caches with a newer version then maybe there's another solution. Caches could choose not to accept origins that do this, inspect the runtime that is being served and decide not to accept them, or require some kind of assurance from the domain that it isn't X old.

kristoferbaxter commented 4 years ago

The resource integrity suggestion is one way to ensure the AMP runtime on the page doesn't meaningfully change from the cache's (and thus the document wouldn't be guaranteed to execute on the cache correctly).

I also like your idea of Caches choosing to not support versions beyond a threshold (and using the content of the self-hosted files to ensure a document is running a compliant version).

@dvoytenko – What do you think about @honeybadgerdontcare's suggestion here?

dvoytenko commented 4 years ago

The caches should definitely reject the "too old" versions. But somehow the self-hosted scripts should indicate what RTV they belong to. Resource integrity can do that, or even a simple attribute amp-rtv="12345678" can do that too. The only exception of this is when we expect the self-hosted scripts to be most recent version, but that partially defeats the reason why someone would want to self-host.

Let's look at a basic premise: a publisher wants to self-host to have their own upgrade cycle. I.e. to avoid any regressions on their site when AMP goes from V to V+1. If they don't specify V anywhere, our cache would simply "optimize" the pages to use V+1, which defeats the publisher's reasoning. So we need them to tell us the V. I don't think updating pages to go from version to version is a big deal.

One note aside: subresource integrity would not work with amp-geo extension since it's a dynamic extension.

kristoferbaxter commented 4 years ago

@dvoytenko This i2i could also improve performance on self hosted documents.

  1. For browsers who employ double or triple key caching, caching is invalidated across document domains.
  2. If the canonical domain has already established a H2/H3 connection for the document, it can reuse this connection for script payloads.
dvoytenko commented 4 years ago

@kristoferbaxter yeah, I was wondering if somehow we could improve the case for the self-hosting for the purpose of only avoiding CDN. Maybe the scripts can state that (as much as possible) they are relying on the most recent releases of AMP. We could even alerts web masters when we notice that the actual served RTV is too far away from AMP's PROD?

kristoferbaxter commented 4 years ago

I agree we should do something here, it would be a shame for documents to not get served by caches because their version of components get too outdated.

@sebastianbenz and @honeybadgerdontcare do you have some ideas about how we could address this? Perhaps as a second step?

dvoytenko commented 4 years ago

But I think we're so far finding two clear use cases, right?

  1. Want to serve a RTV of my choice to interop better with my QA cycle. I.e. the desire is to lock to a specific RTV.
  2. Want to serve the as-much-as-possible latest RTV (WEEKLY or LTS) to avoid 3rd-party CDN when possible. I.e. the desire is to follow AMP's schedule, just not the CDN.
sebastianbenz commented 4 years ago

I like the idea of encoding the AMP runtime version into the markup. This way we could use Search Console to automatically warn publishers if:

This would make hosting slightly more complicated for publishers though, because they not only need to update the JS, but also all their HTML.

+1 to @dvoytenko's summary.

dreamofabear commented 4 years ago

@Gregable suggested adding some runtime code that checks (current_time - RTV) <= max_skew and reloads from AMP CDN if false. Basically, version locking for v0.js.

Pubs could recompile and remove, though most probably won't go through the trouble. One concern is breaking everything if their CSP is misconfigured i.e. doesn't allow cdn.ampproject.org.

honeybadgerdontcare commented 4 years ago

In a previous design review there was a discussion around what RTVs are "green" and what RTVS may be "bad". Do we still have plans on publishing thatj?

I agree we should have some mechanism to say this "RTV" is bad and not supported including saying we are no longer going to rewrite certain RTVs to green. That said, the Validator does no know what RTV a host is serving, nor does Search Console. We'd have to add that ability if we want to restrict what is served/rewritten.

mdmower commented 4 years ago

@sebastianbenz I have an idea of how we could handle the difficulties amp-geo presents when self hosting a runtime.

If a cache is not able to support dynamically generated amp-geo, then fallback to fetching the country when amp-geo loads. Given that /v0/amp-geo-0.1.js is an asynchronously loaded script that appends a body class late in the page load process, this fallback isn't terrible and will only be utilized by caches that can't support the faster option of dynamically altering amp-geo. The clear advantage to this feature is that there would be no need to disable version locking or to create a "stable" version of amp-geo-0.1.js that is many-version compatible.

To support this feature, there needs to be an API from which the country code could be fetched. The AMP project could offer this feature at the CDN level (with cache modification). I suggest this be made part of the release process (not part of the build process), perhaps something like https://cdn.ampproject.org/geo-api-0.1.json returns:

{
  "country": "de"
}

Then, when amp-geo detects unpatched value {{AMP_ISO_COUNTRY_HOTPATCH}}, it queries the API instead of defaulting to "unknown".

sebastianbenz commented 4 years ago

@mdmower I like this idea!

@Gregable would speak anything against providing such an API endpoint?

Gregable commented 4 years ago

@jridgewell and @zhouyx might have more state than me.

@mdmower I'm not sure I understand what returning JSON from https://cdn.ampproject.org/geo-api-0.1.json improves over returning JS from https://cdn.ampproject.org/v0/amp-geo-0.1.js. It does not appear to change anything with regards to @sebastianbenz 's comments about independence, control, or performance.

mdmower commented 4 years ago

@Gregable From the start of my involvement in trying to find a solution for self-hosting, I've been advocating for publishers who would like to self-host the AMP runtime but do not have fancy CDNs that can perform modifications to components based on country (see Guidelines: Adding a new cache to the AMP ecosystem). The solution to this problem so far has been to self-host the entire runtime except for amp-geo, which could be fetched from cdn.ampproject.org. This raises plenty of concern: "can we still use version locking?" and "do we need to offer an 'API stable' version of amp-geo outside of the normal release process?" See sections Version Locking and Dynamic components: amp-geo in the design doc.

The suggestion I am making lets publishers self-host the entire runtime, including amp-geo, and if they do not have the capability to modify amp-geo-0.1.js at the CDN level, the component will fall-back to querying an API to determine the country. The advantage to this method is that it is far easier to make a small JSON file like { "country": "de" } API stable across many runtime versions than it is to hope that amp-geo-0.1.js from any single runtime version will work with many other runtime versions. The level of effort is also very low for the AMP project to help these publishers. The capability to modify a cached resource by country is already implemented for amp-geo-0.1.js and it simply needs to be extended to a new JSON file.

I argue that this hits both of the independence and control categories. The reason is that it broadens the self-hosting audience to include small publishers who can then modify the runtime locally (control) or host with archiving in mind [in case certain rtvs disappear from cdn.ampproject.org] (independence).

mdmower commented 4 years ago

My suggested changes to amp-geo are in draft PR #26407.

mdmower commented 4 years ago

@Gregable suggested adding some runtime code that checks (current_time - RTV) <= max_skew and reloads from AMP CDN if false. Basically, version locking for v0.js.

Pubs could recompile and remove, though most probably won't go through the trouble. One concern is breaking everything if their CSP is misconfigured i.e. doesn't allow cdn.ampproject.org.

If I understand this suggestion correctly, I think it should only be implemented for AMP pages served from the Google AMP cache. Otherwise, you're telling publishers they are not allowed to take snapshots in history to record the way in which a page was rendered by a specific version of the runtime.

Gregable commented 4 years ago

@mdmower Hopefully @jridgewell or someone else on this thread can address version locking and the possible need to upgrade the client-side implementation over time.

From my read of the cache code, I don't see why patching the country into the .json document would be any different than patching it into the .js document. The cache implementation would probably be straightforward.

Otherwise, I'm not sure I have enough context to compare the options.

cramforce commented 4 years ago

Just very generally: We'd like to avoid providing an easily usable general purpose geo location API that folks would easily starting to rely on outside of amp-geo. This is for numerous reasons.

So, I'd strongly prefer a solution that is directly tried to amp-geo in some fashion.

mattwomple commented 4 years ago

Just very generally: We'd like to avoid providing an easily usable general purpose geo location API that folks would easily starting to rely on outside of amp-geo. This is for numerous reasons.

So, I'd strongly prefer a solution that is directly tried to amp-geo in some fashion.

To play devil's advocate, doesn't the API already exist for outsiders?

(await fetch('https://cdn.ampproject.org/v0/amp-geo-0.1.js').then(r => r.text()).then(t => /"([a-z]{2}) {26}"/.exec(t)))[1]
cramforce commented 4 years ago

@mattwomple I think this is a good example of the line. If we break that regex and somebody files a bug here, then I would not in any way feel bad about just closing it.

jridgewell commented 4 years ago

I cannot make a decision on providing a amp-geo JSON API. That's way above my authority.

someone else on this thread can address version locking

There's a very real risk that the locally compiled v0.js will not be compatible with the amp-geo-0.1.js file served by the AMP JS CDN. That's one of the reasons we version lock to begin with.

But what are the chances the locally compiled version will match one of the official versions (say, /rtv/123)? Is the locally compiled version going to have the RTV "123"? Unless you're compiling from the exact same SHA, then no. So the only way you can get Runtime to request /rtv/123/amp-geo-0.1.js is if you're compiling the same SHA.

And if you have any uncommitted local modifications, then you'll get a different local RTV number. And there's no guarantee the property mangling will match (we've observed this before).

If you request the generic /v0/amp-geo-0.1.js, then we're getting really risky.

The only way local serving works without any issues is if you are capable of serving every extension. Providing a a amp–geo JSON API would solve that, but again, I can't make that decision.

mdmower commented 4 years ago

Separate topic from amp-geo:

I have been revisiting my original suggestion to use a <meta> tag to update config.urls.cdn when an AMP document is loaded in singleDoc mode. See PR #25028 and section "Full self-hosting" in the design doc. This may not be the best path forward. Consider this...

Several AMP components import urls, e.g. import {urls} from '../../../src/config';. Since imports are written at compile time, these components have the urls object hardcoded into them. As #25028 current exists, these components do not recognize changes to urls.cdn when an ampdoc instance is created.

I can see a few options to move forward, but I'm not sure which is preferred:

  1. AMP pages include the following script which is automatically stripped by Google AMP Cache optimizer:
    <script id="amp-config">
       window.AMP_CONFIG = window.AMP_CONFIG || {};
       window.AMP_CONFIG.cdnUrl = 'https://example.com/myruntime';
    </script>

    This is by far the easiest solution, as config.js already has support for initializing certain urls members via environment (AMP_CONFIG) variables. The question is whether the AMP page could still be considered valid. Is it possible to ensure the script only writes to AMP_CONFIG and does nothing else?

  2. config.js gets new methods to propagate urls.cdn changes when they are detected. It's important to note that this propagation could occur later than some components need, or would require significant rewrites to change the order of operations in all affected components to "wait for urls promise, then proceed".
  3. AMP pages include the following JSON which is automatically stripped by the Google AMP Cache optimizer:
    <script id="amp-config" type="application/json">
     {
       "urls": {
         "cdn": "https://example.com/myruntime"
       }
     }
    </script>

    config.js gets a new method to check for document and then for #amp-config. If detected, then when urls is defined, JSON properties in #amp-config are merged into the final object. The caveat is that every file that includes import {urls} from 'path/to/src/config would get a bit bulkier and would be running the exact same check (check for element, decode JSON, write to urls object).

sebastianbenz commented 4 years ago

@Gregable @mdmower @zhouyx @jridgewell and I just met and discussed the the proposal. Here is what we came up with:

  1. Publishers need to serve the runtime using the same RTV pattern used by AMP Caches (unversioned URLs will not be supported by the validator).
  2. The runtime host needs to be encoded in the document via meta tag (as described in the original design doc):
    <meta name="runtime-host" content="https://example.com/optional/path/prefix">
  3. amp-geo: we will provide an unpatched version of amp-geo.js that supports calling a publisher hosted geo endpoint to retrieve the current country. The AMP Cache will not provide such an endpoint. Instead, publishers need to configure a geo endpoint via meta tag. amp-geo.js then calls the endpoint on load to retrieve the location:

    <meta name="amp-geo-api" content="https://example.com/location.json">

I'll update the design doc with more details in the next days. Feedback is welcome :-)

zhouyx commented 4 years ago

To add to the amp-geo approach

  1. The amp-geo.js will be updated to make another roundtrip request when the country code is not presented. There will still only be one version of this component.
  2. We are aware of the delay introduced by the roundtrip. Including content further blocked by <amp-consent> waiting for the geolocation, element with amp-geo-pending further blocked, and higher chances of UI shifting caused by appending the CSS with further delay.
  3. Subresourse integrity will not be applied, so that publisher can choose to patch the country code to the amp-geo.js if possible.
  4. New features added to amp-geo.js may require changes to the json response format as well. We will notice publishers in advance if such changes are required.
mdmower commented 4 years ago
  1. Publishers need to serve the runtime using the same RTV pattern used by AMP Caches (unversioned URLs will not be supported by the validator).

Is this correct? I agree with with the rtv pattern requirement, but can't publishers still use unversioned URLs? I think they just need to follow the same logic that cdn.ampproject.org uses to fetch versioned components, e.g. if the AMP page includes script https://example.com/v0.js, and the runtime determines its own version is 123, then version locking results in components downloading from https://example.com/rtv/123/v0/amp-component-0.1.js. Briefly, the requirements are:

jridgewell commented 4 years ago

I think @mdmower is correct. I don't think it's necessary for validator to require versioned URLs in the <script> srcs. It's only required that the publisher respond to https://example.com/rtv/${RTV}/* requests so that runtime may dynamically requests scripts (either due to version locking rejecting a mismatched version, or a missing script, etc).

sebastianbenz commented 4 years ago

It's only required that the publisher respond to https://example.com/rtv/${RTV}/*

This means simple static sites cannot self-host AMP, which is one of the use cases I'd like to support. Is there no way around this?

Also: doesn't the runtime check the version again after downloading or does it assume that it has the correct version if downloading from an RTV endpoint?

mdmower commented 4 years ago

This means simple static sites cannot self-host AMP

I'm not sure how you're reaching this conclusion. A publisher can opt to use RTV specific URLs, in which case they only need to host the runtime at example.com/rtv/123/, or they can use unversioned URLs, in which case they need to host identical runtimes at example.com/ and example.com/rtv/123/.

sebastianbenz commented 4 years ago

Got it - thanks for the clarification. In this case, I'd actually prefer to enforce RTV specific URLs via the validator to keep things simple.

mdmower commented 4 years ago

Got it - thanks for the clarification. In this case, I'd actually prefer to enforce RTV specific URLs via the validator to keep things simple.

Sorry to drag this out, but I'm confused. Why? If a publisher would like to follow the same model as the AMP project where AMP pages reference a rolling AMP release, shouldn't they be able to do so?

sebastianbenz commented 4 years ago

Maybe I'm missing something. What are the advantages for a publisher in having a rolling AMP release if they need to provide an RTV'd endpoint as well?

The only reason I can see is that publishers can encode the import scripts in their templates and don't need to update it with every release. However, as self-hosting is only supported for transformed AMP, this should be done in the transformer anyway.

mdmower commented 4 years ago

@sebastianbenz The RTV paths are still necessary because dynamically loaded components use the RTV paths when version locking is applied.

I don't think there's any reason to stipulate transformation by amp-toolbox's optimizer for self-hosting to be allowed. If one of the goals of this I2I is "independence", then a publisher ought to be able to simply swap out cdn.ampproject.org with runtime.example.org in their AMP documents, add a couple of meta tags, and the runtime does the rest. None of this requires any extra work by the runtime than the changes we've discussed to read the meta tags. I don't even think the validator requires additional work to support rolling release self-hosting URLs. This all matches the AMP project so closely that support is essentially the same.

ithinkihaveacat commented 4 years ago

The self-hosting experience seems a bit bumpy, especially if the developer was expecting a self-hosted "hello, world" to involve unzip and a text editor, or minor search-and-replace surgery of an existing AMP:

I assume it's a non-goal that self-hosting is easy, perhaps make this explicit?

mdmower commented 4 years ago

@ithinkihaveacat I think that's a fair assessment. Version locking is pretty much the reason for the complex JS hosting requirements. I think a simpler hello world example could be possible without version locking, but I'm not sure anyone is onboard with that idea. I largely don't care anymore as long as we can get some self hosting support out the door in the near future. Long term, sure, I would like to see the AMP project loosen it's grip on how folks host the runtime.

mdmower commented 4 years ago

amphtml PR: https://github.com/ampproject/amphtml/pull/26829 amp-toolbox PR: https://github.com/ampproject/amp-toolbox/pull/622 Self-hosted runtime sample project: https://github.com/mdmower/amp-self-host-demo/

jpettitt commented 4 years ago

Re the amp-geo API - when you write the docs it's probably worth pointing out that there are accuracy/performance cache tradeoff considerations on the endpoint. Longer cache window means faster 2nd+ page loads but increases the risk of a wrong country. amp-geo uses 30 min if I recall correctly.

mdmower commented 4 years ago

While it might benefit from some reorganization, I've been preparing a "hosting howto": https://gist.github.com/mdmower/b56e94f0dc36beafb825b0c5e31fceef . Is this something that should be included in #26829 or should it have its own PR where it can be independently scrutinized?

sebastianbenz commented 4 years ago

@mdmower that guide is amazing! Thanks a lot for starting this. Please create a different PR for this and assign @CrystalOnScript as a reviewer.

mdmower commented 4 years ago

PR for AMP framework hosting guide: https://github.com/ampproject/amphtml/pull/27100 (in draft status until #26829 is approved)

mdmower commented 4 years ago

One of the open questions at the end of the original design doc is "Do we want to ship a zip containing all AMP runtime assets with every AMP release, e.g. via NPM?" I would like to see this happen but I don't think there's any way I can contribute (pretty sure only Googlers with access to internal release code/scripts can work on this). Does this task need a new Issue created?

sebastianbenz commented 4 years ago

New issue would be great! Thanks!

westonruter commented 3 years ago

Can we allow for cache-busting query parameters to be added to the versioned scripts? This would make it much easier to juggle writing the scripts to the filesystem and not having to worry about keeping old versions around. In other words, instead of:

<script async src="https://example.com/amp/rtv/012002290616360/v0.js"></script>

Could we support:

<script async src="https://example.com/amp/v0.js?rtv=012002290616360"></script>

For the WordPress plugin, we'd probably be writing the scripts to the filesystem in a directory like wp-content/amp-scripts, so that would look like:

<script async src="https://example.com/wp-content/amp-scripts/v0.js?rtv=012002290616360"></script>

And whenever our scheduled tasks runs to update the AMP scripts, we'd just have to update that one wp-content/amp-scripts/v0.js file as opposed to adding a new wp-content/amp-scripts/rtv/012002290616360, writing v0.js into it, and then deciding which previous rtv directory can be safely deleted based on page cache times.

jridgewell commented 3 years ago

We can, but it requires two changes:

  1. Extension parsing now needs to handle ?rtv= param: https://github.com/ampproject/amphtml/blob/0a4f28343650bb2d0d9627652df52b2de4526c4c/src/service/extension-script.js#L96-L110
  2. Extension insertion needs to request with ?rtv=: https://github.com/ampproject/amphtml/blob/0a4f28343650bb2d0d9627652df52b2de4526c4c/src/service/extension-script.js#L52-L63