ampproject / amphtml

The AMP web component framework.
https://amp.dev
Apache License 2.0
14.9k stars 3.89k forks source link

Self hosting: update validation rules #27546

Open sebastianbenz opened 4 years ago

sebastianbenz commented 4 years ago

See #25873 for context.

For supporting AMP runtime self-hosting the validation rules need to be adjusted:

<!doctype html>
<html ⚡>
<head>
  <meta name="runtime-host" content="https://example.com">
  <meta name="amp-geo-api" content="https://example.com/geo">
  <script async custom-element="amp-experiment" src="https://example.com/rtv/001515617716922/v0/amp-experiment-0.1.js"></script>
  <script async src="https://example.com/rtv/001515617716922/v0.js"></script>
  ...
</head>
<body></body>
</html>

This requires the following updates to the existing validation rules:

  1. meta name=runtime-host

    1. The runtime host declaration needs to be placed directly after the <head>:

      <head>
      <meta name="runtime-host" content="https://example.com">
      ...
    2. content needs to be a valid URL using the HTTPS protocol.

  2. meta name=amp-geo-api

    1. The amp-geo API endpoint declaration needs to be placed directly after the meta name=runtime-host:

      <head>
       <meta name="runtime-host" content="https://example.com">
       <meta name="amp-geo-api" content="https://example.com/geo">
      ...

      Note: the order between runtime-host and amp-geo-api does not matter, I'm not sure if that could be easily expressed with the current validation rule semantics.

    2. content needs to be a valid URL using the HTTPS protocol.

  3. v0.js, custom-element and custom-template

    All AMP script includes can be imported from a different host by replacing https://cdn.ampproject.org with a custom host URL https://example.com/amp. Everything after https://cdn.ampproject.org needs to stay the same.

     <script async src="$HOST/v0.js"></script>
     <script async custom-element="amp-list" src="$HOST/v0/amp-list-0.1.js"></script>
     <script async custom-template="amp-mustache" src="$HOST/v0/amp-mustache-0.2.js"></script>

    The following rules apply:

    1. $HOST must always be the same for all script imports.
    2. $HOST must be a valid URL using the HTTPS protocol.
    3. $HOST must be equal to <meta name="runtime-host" content="$HOST">
    4. Script paths for v0.js, custom-element and custom-template must stay the same:

      <script async src="$HOST/v0.js"></script>
      <script async custom-element="$ELEMENT_NAME" src="$HOST/v0/$ELEMENT_NAME-$VERSION.js"></script>
      <script async custom-template="$TEMPLATE_NAME" src="$HOST/v0/$TEMPLATE_NAME-$VERSION.js"></script>

Open question: should runtime self-hosting only be supported for transformed AMP? See also the discussion here.

There are two valid view points:

  1. Yes: self-hosting is mostly relevant for AMP first sites. AMP First sites should publish transformed AMP because of the greatly improved loading performance. Tying self-hosting to transformed AMP further emphasizes the importance of serving transformed AMP.
  2. No: self-hosting is completely independent of transformed AMP and hence should not make it harder for publishers to self-host the runtime by requiring them to also serving optimized AMP.

//cc @Gregable @honeybadgerdontcare @mdmower @ampproject/wg-caching @ampproject/wg-runtime

mdmower commented 4 years ago

The runtime host declaration needs to be placed directly after the <head>

The amp-geo API endpoint declaration needs to be placed directly after the meta name=runtime-host

The only requirement imposed by https://github.com/ampproject/amphtml/pull/26829 is that <meta name="runtime-host"> and <meta name="amp-geo-api"> must appear before the first <script> in the <head>. This puts these meta tags in the "metaOther" group according to the ReorderHeadTransformer in amp-toolbox's optimizer:

https://github.com/ampproject/amp-toolbox/blob/9315bada990758789df251d64a37487cae29591a/packages/optimizer/lib/transformers/ReorderHeadTransformer.js#L61-L66

I'm only just starting to look at validator rules, so I'm not sure whether this kind of rule is possible, but wanted to at least point it out.

v0.js, custom-element and custom-template

This looks correct.

should runtime self-hosting only be supported for transformed AMP

I'm in the "No" category, here. This raises the barrier to entry. We shouldn't force feed optimizer, but rather encourage its use.

mdmower commented 4 years ago

This is about as far as I got:

# Alternate runtime host metadata, name=runtime-host
# https://github.com/ampproject/amphtml/pull/26829
tags: {
  html_format: AMP
  tag_name: "META"
  spec_name: "meta name=runtime-host"
  mandatory_parent: "HEAD"
  attrs: {
    name: "content"
    mandatory: true
    value_url: {
      protocol: "https"
    }
  }
  attrs: {
    name: "name"
    mandatory: true
    value: "runtime-host"
    dispatch_key: NAME_VALUE_DISPATCH
  }
  # TODO: spec_url: "<url to amp hosting guide>"
}
# amp-geo API metadata, name=amp-geo-api
# https://github.com/ampproject/amphtml/pull/26829
tags: {
  html_format: AMP
  tag_name: "META"
  spec_name: "meta name=amp-geo-api"
  mandatory_parent: "HEAD"
  attrs: {
    name: "content"
    mandatory: true
    value_url: {
      protocol: "https"
    }
  }
  attrs: {
    name: "name"
    mandatory: true
    value: "amp-geo-api"
    dispatch_key: NAME_VALUE_DISPATCH
  }
  # TODO: spec_url: "<url to amp hosting guide>"
}

# ...and blacklist
# "runtime-host|"
# in spec "meta name= and content="

I thought the validator already had support for rtv-specific script URLs like https://cdn.ampproject.org/rtv/012004030010070/v0.js, but alas, I remembered incorrectly:

I'm not really sure how to enforce tag ordering or host equivalence between multiple tags. This task might be better suited for a more experienced validation developer (or if someone really wants to hold my hand through this, I'm still more than happy to keep working on it).

sebastianbenz commented 4 years ago

@Gregable can you please take a look whether the proposed rules are feasible with the current validation rule semantics?

Gregable commented 4 years ago

Apologies for the sluggishness in responding.

Ordering There is currently no mechanism for the validator to enforce child element ordering that are suitable to the task.

The similar ones are:

We'd need to implement something. Something like mandatory_first_child, similar to the above could work pretty well except for:

<meta charset="utf-8">

Really should be the first child of <head>. Until the charset is encountered, the browser doesn't know how to handle multi-byte characters, which this host name could potentially include unless we restrict it to punycode domains, which would not be great.

(Side note: <html ⚡> is technically a problem here).

This is partially tangential at the moment, because I doubt we could add the charset ordering requirement now anyway: that ship has sailed. The important part is that requiring something other than charset to be first is problematic. We would thus need a rule something like "if runtime-host is present, then charset must be first and runtime-host must be second". or "runtime-host can either be first or 2nd behind charset". This is a messy rule to write however.

On the other hand, a rule of "runtime-host if present must be before all <script> tags", is a far less messy rule and error message. It too would require some new validation mechanism, but that's not a problem.

runtime-host Are we certain we need the runtime-host parameter at all? The first script tag will implicitly include it, and it should be easy to parse out with some introspection, no?

src requirements The requirements that allow us to do more complex parsing of src attributes for script tags will need some new validation mechanism, but I don't see any major problems with implementing such.

P2 How urgent is this change?

mdmower commented 4 years ago

On the other hand, a rule of "runtime-host if present must be before all <script> tags", is a far less messy rule and error message. It too would require some new validation mechanism, but that's not a problem.

This sounds promising and worthwhile to pursue to me (if only I was the one signing your paychecks...)

runtime-host Are we certain we need the runtime-host parameter at all? The first script tag will implicitly include it, and it should be easy to parse out with some introspection, no?

This would probably increase the size of config.js, which is imported into several files at build time: find first script (not allowed to assume head is fully parsed), get URL part that precedes v0/ or v0.js, hope that dev didn't define their paths like example.com/v0/amp/v0.js, and then set url prefix in config. I'm not against the idea, per-say, I'm just wary of the resistance I could face in such a PR.

Gregable commented 4 years ago

@jridgewell could you help determine if such an approach (runtime determining the host parameter) would be reasonable or not before we invest in the approach?

jridgewell commented 4 years ago

A few technical points:

The runtime host declaration needs to be placed directly after the <head>:

I agree with @Gregable, we should actually prioritize the <meta charset="utf-8"> as the first child with runtime/geo being directly after. I believe certain browsers will only scan the first few kb for this meta, before giving up.

We can loosen this to just "before the first <script>" tag, though. I actually like this approach, giving CMS the most flexibility.

runtime-host Are we certain we need the runtime-host parameter at all? The first script tag will implicitly include it, and it should be easy to parse out with some introspection, no?

This would probably increase the size of config.js, which is imported into several files at build time:

I think he's actually suggesting that this isn't in config.js URLs at all. We would query for script[src$="/v0.js"], and then parse out the domain and dirname from the src. This should only affect the few scripts the directly import src/service/extension-location.js: v0.js, analytics, and amp-script. Maybe 1 or 2 others?

I like this approach, too. It means we no longer need to strip the <meta runtime> element when serving from cache, since the cache's should be rewriting the v0.js script src already.

mdmower commented 4 years ago

I think he's actually suggesting that this isn't in config.js URLs at all. We would query for script[src$="/v0.js"], and then parse out the domain and dirname from the src. This should only affect the few scripts the directly import src/service/extension-location.js: v0.js, analytics, and amp-script. Maybe 1 or 2 others?

I'm wary of this suggestion because it leaves urls.cdn at its default value, which other scripts could then interpret to mean resources (additional css for example) should come from cdn.ampproject.org instead of example.com. I think this needs to be done in config.js, not just in extension-location.js for extension URLs generation.

mdmower commented 4 years ago

@Gregable @jridgewell @sebastianbenz I don't think changes are necessary to our current self-hosting pattern (meta tags before scripts) and I suggest we tone-down the change request for the validator.

The validator is concerned with making sure AMP pages are ready to be cached. Since the validator could expect that the AMP cache transformer would perform certain cleanup, fewer adjustments to validation are necessary. Notably, the validator could completely ignore the meta tags:

  1. Transformer should strip <meta name="runtime-host"> and <meta name="amp-geo-api">.
  2. Even if these meta tags are not stripped, they are ineffective when an AMP page is served from the cache. This is by design.

Then, the only adjustment necessary in the validator is to ensure all script URLs can be rewritten to cdn.ampproject.org. The validator could use <script src="https://example.com/v0.js"> to identify the base AMP runtime URL and then if any scripts are detected that do not match script[src^="https://example.com/"], validation would fail.

Using the validator to ensure <meta name="runtime-host"> and <meta name="amp-geo-api"> appear before any scripts would be a "nice to have" (maybe a warning instead of an error?), but I don't see it as the validator's responsibility to make sure publishers are self-hosting correctly. It just needs to make sure an AMP page can be served from the cache.

PlumbersNearMe commented 4 years ago

I have implemented on my production website, so far it seems to work great. Although I am getting some errors in the console, it affects the best practices score in lighthouse as well. I am getting about 6 errors of the same type each page load. They are all similar to this:

Should never attempt to set cookie on proxy origin.

Are these something I am able to remedy on my end? I built the amp run time and implemented according to the awesome guide from mdmower. I am excited to see how it performs over time and how google serps will treat it.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

westonruter commented 2 years ago

Still needed.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.