SkygearIO / features

Feature Tracking Repo for Skygear
Apache License 2.0
3 stars 12 forks source link

prerender options for deployments for SSR to solve SEO issues #310

Open chpapa opened 5 years ago

chpapa commented 5 years ago

Description

Deployment:

deployments:
  static-index:
   type: static
   src: index.html
   path: /
   seo-prerender: true
   seo-prerender-expiry: 3600s
   seo-prerender-capture-timeout: 20s 
   seo-prerender-user-agent: googlebot
   seo-prerender-url:
     - /
     - /unreachable-from-root
   seo-prerender-config: prerender-common.yaml

golang:

chromedp to generate dom

Details:

Scenario

When the content of a SPA needs SEO

API Design

Add an attribute use-prerender-for-seo: bool to existing CF configuration (Storage format of config doesn't seem to have been specified in api.md so I am specifying type here anyways)

Potential additional attributes:

seo-prerender-config: text: path to a yaml file that groups other attributes. Or accept string such that people can pipe to it? What is the convention in skygear here? seo-prerender-expiry: text: The amount of time a cached dom is kept. Support postfix time format? Default to 1 day (align to prerender.io)? Do we have a convention for skygear elsewhere? seo-prerender-capture-timeout: number: The amount of time the service wait until it captures the dom. Default to 0, which means the service will capture after the initial batch javascript had run. seo-prerender-seo-agent: text: A comma-separated list of user-agent that would trigger prerender. (Or do we have better format?)

Support configuration overriding, i.e attributes in deployments override those which are specified in prerender-config?

Open Questions:

  1. Do we implement the re-routing part by hard-coding configuration inspection of the particular relevant attributes, or do we build a route customizer to allow re-route by inspection of configuration attributes and http request headers, then implement the re-routing using said customizer?
  2. Where do we store the cache and relevant logic? Handling in skygear directly means we have to create another http-service as the middle man that store result from the prerender service, or making it caching a first-class citizen of skygear somehow. Handling it in prerender service means it is indeed a standalone microservice but I am not sure if it fits skygear's feature set model (i.e. monolithic vs micro).

Related Issues

roxk commented 5 years ago

Description

Deployment:

deployments:
  static-index:
   type: static
   src: index.html
   path: /
   use-prerender-for-seo: true
   prerender-config: prerender-common.yaml
   prerender-expiry: 3600s
   prerender-capture-timeout: 20s 
   prerender-seo-agent: googlebot

golang:

chromedp to generate dom

Details:

Scenario

When the content of a SPA needs SEO

API Design

Add an attribute use-prerender-for-seo: bool to existing CF configuration (Storage format of config doesn't seem to have been specified in api.md so I am specifying type here anyways)

Potential additional attributes:

prerender-config: text: path to a yaml file that groups other attributes. Or accept string such that people can pipe to it? What is the convention in skygear here? prerender-expiry: text: The amount of time a cached dom is kept. Support postfix time format? Default to 1 day (align to prerender.io)? Do we have a convention for skygear elsewhere? prerender-capture-timeout: number: The amount of time the service wait until it captures the dom. Default to 0, which means the service will capture after the initial batch javascript had run. prerender-seo-agent: text: A comma-separated list of user-agent that would trigger prerender. (Or do we have better format?)

Support configuration overriding, i.e attributes in deployments override those which are specified in prerender-config?

Open Questions:

  1. Do we implement the re-routing part by hard-coding configuration inspection of the particular relevant attributes, or do we build a route customizer to allow re-route by inspection of configuration attributes and http request headers, then implement the re-routing using said customizer?
  2. Where do we store the cache and relevant logic? Handling in skygear directly means we have to create another http-service as the middle man that store result from the prerender service, or making it caching a first-class citizen of skygear somehow. Handling it in prerender service means it is indeed a standalone microservice but I am not sure if it fits skygear's feature set model (i.e. monolithic vs micro).
chpapa commented 5 years ago

@roxk Thanks! Some comments:

use-prerender-for-seo: true prerender-config: prerender-common.yaml prerender-expiry: 3600s prerender-capture-timeout: 20s prerender-seo-agent: googlebot

So it could be something like:

seo-prerender: true
seo-prerender-expiry: 3600s
seo-prerender-capture-timeout: 20s 
seo-prerender-user-agent: googlebot
  • Reserve a path for skygear's internal use (is it a thing in skygear?)
  • Register a http-service internally (as if by manual configuration) using said path

Probably need @carmenlau @louischan-oursky input, my thought:

  • Router re-route path whose type is static and with use-prerender-for-seo: true to said path if user-agents of the request matches a list configured by developer

I think there is a wrong assumption: pre-render is not just for static content. It is useful for cloud functions or microservices too.

  1. Do we implement the re-routing part by hard-coding configuration inspection of the particular relevant attributes, or do we build a route customizer to allow re-route by inspection of configuration attributes and http request headers, then implement the re-routing using said customizer?

As mentioned above, I think likely it needs to be done by @carmenlau or @louischan-oursky ; On your question, I think it is possible we build the route depends on UA-header.... but I'm open to it if hard code is way easier and better performance.

2. Where do we store the cache and relevant logic? Handling in skygear directly means we have to create another http-service as the middle man that store result from the prerender service, or making it caching a first-class citizen of skygear somehow. Handling it in prerender service means it is indeed a standalone microservice but I am not sure if it fits skygear's feature set model (i.e. monolithic vs micro).

Standalone microservices (we call it "Gear" in Skygear, like "Auth Gear")

roxk commented 5 years ago

seo-prerender: true seo-prerender-expiry: 3600s seo-prerender-capture-timeout: 20s seo-prerender-user-agent: googlebot

My concern is just that there might be other triggers for prerender in the future, then seo-prerender-expiry might be misleading...I'm not familiar with the demand of prerender so this concern might be groundless and can be safely ignored.

I think there is a wrong assumption: pre-render is not just for static content. It is useful for cloud functions or microservices too.

Got it.

Standalone microservices (we call it "Gear" in Skygear, like "Auth Gear")

If I'm reading it correctly it means that you think the prerender gear should handle prerender+caching, as if this gear is an exact clone of prerender.io without their auth part. Am I correct?

Off topic, I'm quite confused about the distinction between http-service and gear in terms of both the concept and actual implementation details, since from my point of view they are both "just" a microservice. Does it have to do with load-balancing/scheduling etc? Or is it about access rights such that gear is completely private from the developer? Where can I read more about that? I tried searching gear in the feature repo but it only yields limited result briefly mentioning auth gear. Thanks.

chpapa commented 5 years ago

My concern is just that there might be other triggers for prerender in the future, then seo-prerender-expiry might be misleading...I'm not familiar with the demand of prerender so this concern might be groundless and can be safely ignored.

True, but I guess let's just tackle it later when we have the use case (one lesson learnt on developing Skygear is if we try to imagine too much on future features we will make things too complicated and difficult to use...)

If I'm reading it correctly it means that you think the prerender gear should handle prerender+caching, as if this gear is an exact clone of prerender.io without their auth part. Am I correct?

Off topic, I'm quite confused about the distinction between http-service and gear in terms of both the concept and actual implementation details, since from my point of view they are both "just" a microservice. Does it have to do with load-balancing/scheduling etc? Or is it about access rights such that gear is completely private from the developer? Where can I read more about that? I tried searching gear in the feature repo but it only yields limited result briefly mentioning auth gear. Thanks.

yea sorry we don't have enough documentations for now. Simply speaking, http-service is microservices developed by users of Skygear, they run the Skygear Cloud Platform based on fission, and each Skygear App have its own microservices.

Gear runs on its own, they are multi-tenant (so one gear daemon could serve multiple apps / tenant). Skygear Auth is the only gear in Skygear 2.0 for now, but we will have more soon and expect Skygear Developers can develop new gears relatively easily.

roxk commented 5 years ago

So http-service is per app and gear is per deployment. Got it. I'm trying to write the prototype of the gear now. Now I know I missed some information: I need to specify the gear's input. I think I will copy prerender.io for now and see where it goes.

louischan-oursky commented 5 years ago

Just have a in person chat with @roxk and @carmenlau

Here is the summary.


Roxk will continue to work on the prototype of the prerender gear.


The prerender gear is a HTTP server with the following API

GET /?u=<url>

The cache storage is assumed to be Redis.


If the content type is not text/html, the simplest handling is just return it as it is and do not cache nothing. We may need further discussion on this.


We need to teach the gateway to route a particular path to the prerender gear if the path is prerender enabled and the HTTP user-agent header qualifies as a web crawler.

chpapa commented 5 years ago
  • Otherwise, it initiates a HTTP request to the url.

Ideally, the gear also comes with a scheduler to regularly cache the pages, so most of the result are from cache instead of on-demand rendered. It is because the feature is for SEO purpose, and return speed is a factor for google page ranking.

(But that could be 2nd phase)

carmenlau commented 5 years ago

Yes, you are right. So in the first version we plan to separate the cache store, so the scheduler can be added on the top to regularly update the cache store.

carmenlau commented 5 years ago

Btw, I found an interesting feature of prerender.io. Instead of using a fixed timeout, user can also define when the page is ready for capture. See Is your page only partially rendered? in https://prerender.io/documentation/test-it.

roxk commented 5 years ago

Yeah, but that is essentially making the site aware of the presence of predener.io, which I'm not sure if it is a good thing™...

chpapa commented 5 years ago

Btw, I found an interesting feature of prerender.io. Instead of using a fixed timeout, user can also define when the page is ready for capture. See Is your page only partially rendered? in https://prerender.io/documentation/test-it.

I can see it useful at some edge case...

roxk commented 5 years ago

I just had a brief discussion with @carmenlau on how prerender service should "read" its configuration. Summary:

  1. The service does not read configuration directly. Instead, it accepts parameter (as HTTP headers) in addition to the URL to prerender. The parameters are:
    • App id
    • Item id
    • Expiry time
    • Wait time until render

Following existing naming convention, converting them to header yields respectively:

Since each appid/itemid can potentially render the same url with different expiry configuration, the key of each cached DOM needs to contain appid/itemid for unique identification.

Expiry time is expected to be in seconds. Support for custom time literal like "3600s", "1d", etc is delegated to deployment mechanism/configuration parser.

Expiry time is only applied upon a cache miss. In the future where periodic prerender job is introduced, an optional x-skygear-prerender-force-render can be introduced to invalidate cache and prerender using just one request.

Render wait time is also expected to be in seconds.

  1. For the sake of consistency, the previous GET /?u=<url> is deprecated. An extra header, x-skygear-prerender-url, will be used to provide the URL to prerender instead. Together with the headers introduced in (1), now we have a total of 5 headers for prerender service:

    x-skygear-prerender-appid
    x-skygear-prerender-itemid
    x-skygear-prerender-url
    x-skygear-prerender-expiry-seconds
    x-skygear-prerender-render-wait-seconds
  2. A cache-invalidation mechanism should be provided. A typical use case of this feature is when user altered expiry time of their deployment items and upon re-deployment such changes in configuration would be detected and the procedure would invalidates caches in the prerender service accordingly such that new configuration would be applied upon next prerender request or next scheduled periodic prerender job.

A new end point is proposed for this feature:

POST /invalidate

In the request body a json is expected to specify which app id and item id to invalidate. Example:

[
    {
        "appid":"appOne",
        "itemids":["MyAwesomeHttpService", "MyCloudFunction"]
    },
    {
        "appid":"appTwo",
        "itemids":["MyCloudFunction"]
    }
]

Open questions/Observations:

  1. As I was typing this out, I noticed that each cached DOM is identified by app id and item id, which is leaking skygear's internal architecture to prerender service. Any changes to the appid/itemid scheme would propagate to prerender service. Should we accept a more general x-skygear-prerender-requestid such that prerender is more resistant to change? Deployment mechanism/gateway should be able to merge appid/itemid as a single requestid on their own.
louischan-oursky commented 5 years ago

@roxk Could you update the description above per our offline discussion? Thanks!

roxk commented 5 years ago

Summary:

  1. Headers are no longer used for parameter. Query parameter is used instead.
  2. Prerender service no longer has knowledge of appid/itemid. It invalidates caches based on URLs alone. This is possible because
    • Each app has a unique domain in url, so app id is not needed.
    • Although each item has multiple paths, gateway/other services would provide all such sub paths
  3. New api:

    • Prerender

    • list of all query parameters

      u=<url>
      expiry=<seconds> (optional)
      renderWait=<seconds> (optional)

      Default value for expiry is 3600. Default value for render wait is 0.

    • Cache invalidation

    • json request body: list of domain to invalidate, prerender service invalidates all caches whose key contains any path specified in the list

      ["http://abc.com/", "http://edf.com/"]
louischan-oursky commented 5 years ago

It would be great if we support multiple paths in cache invalidation

Given a payload like

{
  "origins": [
    { "origin": "https://abc.com", "paths": ["/a", "/b" ] },
    { "origin": "https://def.com", "paths": ["/foo", "/bar" ] },
   ]
}

Should invalidate https://abc.com/a https://abc.com/b https://def.com/foo and https://def.com/bar

roxk commented 5 years ago

Sure. Does it support invalidating all paths of a given host? What should be the paths? Empty array or null?

roxk commented 5 years ago

Already updated to latest api for prerender and implemented cache invalidation. Empty/null/Missing path is interpreted as using origin as the url.

louischan-oursky commented 5 years ago

Sure. Does it support invalidating all paths of a given host? What should be the paths? Empty array or null?

In that case the payload should be

{
  "origins": [
    { "origin": "https://abc.com", "paths": ["/" ] }
   ]
}

Sorry if I didn't explain the meaning of path, it should be treated as a prefix to match actual path against. So a / will invalidate everything for a given origin. Similarly, /a will invalidate /a, /a/, /a/b and etc. Whether /a will invalidate /apple is a good question. Ideally /apple should not be invalidated by /a. But if it is difficult to implement efficiently, /a can also invalidate /apple.

roxk commented 5 years ago

I think if we want to invalidate everything for a particular origin, Empty/null/Missing paths is good enough. I will update to also support "paths":["/"]. In particular, the service does not force the array to have only one element. It would merely check the presence of "/" in paths.

One thing to add the api spec: duplicated entries in paths are simply ignored.

What is skygear's policy for unexpected input which is correctable? e.g. if https://abc.com/ is passed for origin, do I remove the last / or return 400?

louischan-oursky commented 5 years ago

What is skygear's policy for unexpected input which is correctable? e.g. if https://abc.com/ is passed for origin, do I remove the last / or return 400

Be lenient. As long as the url is parsable and is of scheme HTTP, take the hostname only and ignore the path.