Make usage of external rendering services configurable

trieloff commented 4 years ago

Copying from https://github.com/adobe/project-helix/issues/457 (thanks @MarquiseRosier for starting the issue, I think we can have this in helix-home, but I cannot transfer the issue.

From the helix-chat discussion about visual validation testing, if we could figure out a neat way to configure helix's rendering chain in a config file based manner; that would be a great step towards global test coverage, so that we can test service changes in a live manner!

Posting discussion from Slack Below:

@tripodsan:

the problem is, that we want to request any site, eg: theblog--adobe.project-helix.pages against such a testing configuration so I think the logic from setting a configuration has to be built into all components. e.g. the pipeline (or in the future, the content-proxy_) needs to invoke a different version of word2md. for helix-pages, this can be solved using strains, maybe. so we deploy a strain for aa specific test configuration. we could create sequences for all services with that strain name. so eg: we create a strain staging-xyz123 and we create sequences for dispatch@staging-xyz123, word2md@staging-xyz123, etc. and all actions that invoke other actions check if the current strain is a stagning strain, and use the rspective version instead.

@kptdobe:

Really complicated yes I do not think those test belong to the pipeline. Also, for simplicity, I would not mix the 2 problems here: create UI tests to validate that we do not break something. I think this belongs to helix-pages, where we already have something "similar" (test the DOM) - https://github.com/adobe/helix-pages/blob/master/.circleci/config.yml#L259-L287 "WIP service" testing: when you create a branch in one of the service (word2md...), the overall integration tests are triggered and run against that "WIP service". We have something "similar" ("on commit" testing) but only for module dependencies of helix-cli (+ the pipeline) - https://github.com/adobe/helix-continuous/blob/master/.circleci/config.yml#L196-L197 I think 2 is much more important than 1. And the existing part of 1 is not yet connected of the existing part of 2: helix-pages tests are not part of the "on commit" testing: adobe/helix-continuous#85 (did not have time to move forward - and still secretly hoping that CircleCi will provide something to help this multiple-repo workflows)

@kptdobe:

I think I am saying the opposite. Making sure we test the services in the whole story is more important than UI comparison.

@tripodsan:

when I talked to @davidnuescheler recently, he underlined again the importance of global test coverage. i.e. that we compare the HTML of all (semi) production sites before (or at least after) any new commit. turning a byte in word2md or dispatch can affect all of the rendering, even with no new pages or pipeline update. if we could hard-wire the service versions we use through out a rendering chain for helix-pages, for example. we could also ensure that every service update needs a helix-pages update before a new versions gets used.

trieloff commented 4 years ago

We are about to re-invent dependency injection.

If we do so, we should make sure we don't create another tangled mess of forced indirections and inscrutable configuration files and stay close the heart of the (web) platform.

At the most basic level, what we need is:

a way for the caller of a service to replace a symbolic name (e.g. embed@v1) with a concrete implementation (e.g. embed@ci2917)
a way to provide a mapping table to the caller of a service either at runtime or at deploy time
a way to propagate the mapping table downstream from the top level service to intermediate services

1 and 3 depend on 2, so we should look at this first.

Assume that (2) is a JSON document with keys being the symbolic name and values being the concrete implementation like:

{
  "embed@v1": "embed@ci2917",
  "static@v2": "static@ci15321"
}

Approach A: fetching the lookup table before making any service calls

The caller would store the lookup JSON at a public location that can be identified by the URL, e.g. in a GitHub repository, as a helix-lock.json (because it looks like a package-lock.json file)

The caller would then include the public URL of helix-lock.json in an HTTP header or Cookie x-helix-lock when making requests to the Helix Site.

helix.vcl would be changed to Vary: x-helix-lock, to avoid getting stale versions.

The called service would include a wrap(helixLock) step that:

fetches the file specified by x-helix-lock
~~stores the value in __ow_dependencies with a structure similar to __ow_headers~~
stores a __ow_lookup function in the params.
adds the Vary: x-helix-lock response header to any response

In the called service, all references to ow.action.invoke and fetch would wrap the function name in __ow_lookup("embed@v1"), which would either be resolved to the specified implementation or return the original value in case nothing has been provided.

Also, all invocations, using either ow.action.invoke or fetch would forward the x-helix-lock header value, solving (3)

This approach would be relatively simple to implement, but quite invasive to the code of the service and bear the risk of forgetting to either lookup the correct value or forward the lookup definition.

Approach B: A dynamic API gateway

The caller would use the lookup JSON to create a new, disposable API gateway, using a new micro service that:

calculates the hash of all keys and values in the mapping table, e.g. 34da4c9
in the Fastly service config for *.helix-services.io creates a lookup key 34da4c9:embed@v1 and value embed@ci2917
returns 34da4c9 .helix-services.io as a result

The caller would then include 34da4c9 .helix-services.io in an HTTP header or Cookie x-api-gateway when making requests to the Helix Site.

helix.vcl would be changed to Vary: x-api-gateway, to avoid getting stale versions.

The called service would include a wrap(apiGateway) step that:

stores the value 34da4c9 .helix-services.io in __ow_api_gateway
adds the Vary: x-api-gateway response header to any response

In the called service, all references to ow.action.invoke and fetch would replace the host name with __ow_api_gateway.

The Fastly service at 34da4c9 .helix-services.io would perform a lookup of the function name in the edge dictionary and adjust the URL accordingly before forwarding the request to adobeioruntime.net

Also, all invocations, using either ow.action.invoke or fetch would forward the x-api-gateway header value, solving (3)

This approach would be relatively a bit harder to implement, marginally less invasive to the code of the service and still bear the risk of forgetting to either replace the hostname or forward the header.

kptdobe commented 4 years ago

As expected, this looks really complicated :)

First, what are concretely the non-hypothetical use cases we want to address here ?

I see only one (maybe there are more, just to be on the same page): be able to test the full helix-pages chain of service calls where one of the service is a "custom one" / not the production one, i.e the one with some new code in a branch of that service.

Second, for that use case, I think we only need the capability to specify one custom service in the full chain. The case we need more than one custom service is an edge case: this is usually when we break backward compatibility. This happens in our current setup but really rarely and we can handle it manually. This does not fundamentally change what you describe since we still need provide the info across the whole chain (a single entry or a list - same complexity).

Third, this looks really over-engineered. If this is the single use case above, why we cannot just specify one header in the initial request (could be protected like the x-debug header) with the required info (could be the json of your helix-lock) ? We only need to make sure this header is propagated in the whole chain and that services consumes this header accordingly, i.e. they read the header before calling another service to know the version they have to use for that service ?

trieloff commented 4 years ago

Don't ask a German if you can't handle a bit of over-engineering 😉. I like the idea of just passing it as a header, and we could just use URL encoded key-value pairs instead of JSON.

x-helix-dependencies: serviceid=…&embed@v1=embed@ci2917&static@v2=static@ci15321

Fastly takes care of validating the serviceid, but we would still need the wrappers and diligence on the side of the calling services.

tripodsan commented 4 years ago

I like the over-engineered version. and I disagree that it is only 1 service to be tested. there might be a time, when you have several updates to services that implement a new feature.

for example for the new content-proxy:

change pipeline to use content-proxy
change dispatch to remove fallbacks (now handled in content-proxy)
change embed to use the content-proxy

(arguably, this can be done iteratively :-).

having a stable config file (helix-lock.json) is also very cool, because it can be part of the pull request and should serve a source for the x-helix-lock header (I don't like x-helix-dependencies because it is more than just that).

as for the approaches:

A As long as we don't try to patch the ow/fetch/request API, but add explicit calls to __ow_lookup(), or wrap the APIs explicitly, I'm fine.

so either:

const res = await ow.actions.invoke(__ow_lookup(name));
.
.
const ret = await fetch(__ow_lookup(url), ... )
.
.

or

const ow = lookup_wrap(openwhisk());
.
.
.
const fetch = lookup_wrap(require('helix-fetch'));

B It doesn't solve the problem that you need to add extra code for every service and the complexity is way bigger, as it needs extra magic in fastly.

kptdobe commented 4 years ago

having a stable config file (helix-lock.json) is also very cool, because it can be part of the pull request

Which PR ? The one in pipeline, embed and / or dispatch ? How does the lookup resolution look like ? Where is stored the helix-lock.json ? For the single service test use case, does it mean we should "patch and commit" this file on the fly before sending the request ? To support the case "one change = 3 PR in 3 different repos" and run the tests accordingly, we have to change more than this.

trieloff commented 4 years ago

const ow = lookup_wrap(openwhisk());
.
.
.
const fetch = lookup_wrap(require('helix-fetch'));

I like that.

The header-based approach would still work with multiple dependencies.

tripodsan commented 4 years ago

Which PR ? The one in pipeline, embed and / or dispatch ? How does the lookup resolution look like ? Where is stored the helix-lock.json ?

choose one. :-) in this scenario, one PR can be selected as 'main' one. as soon as the PR has a helix-lock.json, the ci will use those to configure the smoke-tests. otherwise it will just use the current @ciXYZ override for the current service.

tripodsan commented 4 years ago

or maybe simpler, have a special comment in the PR:

# CI HEADERS OVERWRITE
x-helix-lock: serviceid=…&embed@v1=embed@ci2917&static@v2=static@ci15321
x-my-test-header: toby_was_here

and the CI looks at the PR comments and automatically configures the smoke-test!

MarquiseRosier commented 4 years ago

I'm willing to start working on this once we all agree on a design! :)

@tripodsan how do you think setting the x-helix-lock in a PR comment should propagate to the wrapper?

I guess CI environment vars?

tripodsan commented 4 years ago

@tripodsan how do you think setting the x-helix-lock in a PR comment should propagate to the wrapper?

fetching the PR and scanning for the comments is probably not so easy just in bash, and we probably need to add a nodejs script in helix-testutils.

but basically you need $CIRCLE_PULL_REQUEST and a $GITHUB_TOKEN (which should already be in the environment), then fetch the PR comment and look for the comment.

and then you need to pass the x-helix-lock parameter to the helix-pages smoke test somehow... @kptdobe know probably how to do this.

see https://circleci.com/docs/2.0/env-vars/#built-in-environment-variables

kptdobe commented 4 years ago

Here is the code that start a remote workflow passing parameters and wait for its execution: https://github.com/adobe/helix-continuous/blob/master/.circleci/orbs/helix-smoke-tests/orb.yml#L61-L117 On the other side, helix-pages would receive the params and handle it accordingly: https://github.com/adobe/helix-pages/blob/master/.circleci/config.yml#L72-L88 - here we compute the DOMAIN and the SERVICE - another job might get the x-helix-lock and make it available to the test scripts or make sure it is leveraged (like here https://github.com/adobe/helix-pages/blob/master/.circleci/config.yml#L182)

MarquiseRosier commented 4 years ago

^^ I don't think I should take this; the task seems a bit outside of my experience and would take a while longer than someone who knows this stuff

tripodsan commented 4 years ago

^^ I don't think I should take this; the task seems a bit outside of my experience and would take a while longer than someone who knows this stuff

I think when we break up the task in smaller chunks, it gets easier. Also, we can split the work. I see the following building blocks (assuming we follow @trieloff 's suggestion A):

wrapper

create the wrapper for openwhisk and fetch that takes a a x-ow-version-lock (more generic name?) string and tweaks the invocations to the actions and http requests.

I would add this to openwhisk-action-utils as a generic utility. this might be useful outside helix. this is a rather self containing task, which shouldn't be too difficult.

updating all actions

once we have the wrapper, we need to update all the actions. this is not really difficult, but depending on the action (eg pipeline) the change might need some refactoring. afaics, we don't have many delegating actions: pipeline, dispatch, (content-proxy)

adding support in fastly

some of the actions (like dispatch) are invoked from fastly. so the VCL needs to be adjusted to understand the x-ow-version-lock header and invoke dispatch differently. I guess, this is rather difficult, and needs an VLC expert :-)

retrieving the x-ow-version-lock information

There are many examples of how to process pull requests. to implementing the PR scanner is rather simple - agreeing of a comment format might be more difficult :-) I would place this code in helix-test-utils. then circleci can npm install it, if the current project not already contains it.

updating the smoke tests

the smoke tests helix-pages need to be adjusted, so that they can receive the x-ow-version-lock parameter and use it when requesting the pages. this shouldn't be too difficult either.

updating the CI to delegate the `x-ow-version-lock` parameter

the smoke-test invoker need to be extended by:

extracting the x-ow-version-lock header from github.
passing the header along to the target tests.

@MarquiseRosier WDYT?

MarquiseRosier commented 4 years ago

@tripodsan well when you put it that way lol; thanks for breaking it down; yes, i'm glad, I definitely want to help on this; it's a really cool idea 💯:)

tripodsan commented 4 years ago

@MarquiseRosier maybe a good way to start is to create issue in the respective repositories for the steps I mentioned above.

tripodsan commented 4 years ago

I spoke with @davidnuescheler today about how customers would prepare for breaking changes. Ideally, we have a way for them to run against a helix-pages that has the currently planned breaking changes enabled.

Let's say we have a september-changes branch in helix-pages. which contains those changes in the helix-pages code that might break compatibility (js/css). We could also have the aforementioned helix-lock.json that would record the service versions that influence page rendering that would change, eg: gdocs2md, word2md, dispatch, data-embed.

Then we would need a easy mechanism so that the customers can:

run the hlx cli for local development against the changes
run their staging branch of the site against the changes. maybe having the same branch name.

tripodsan commented 4 years ago

I started with https://github.com/adobe/openwhisk-action-utils/issues/109

tripodsan commented 4 years ago

also see https://github.com/adobe/helix-pages/pull/459 as a possible approach to stage a set of changes.

tripodsan commented 4 years ago

By using strains to select the staging branch, it can more or less easily be solved with the existing mechanisms. however, because the strains configuration needs to be available during publish, it is not possible (convenient) to use a separate branch. but a separate branch is needed if a different pipeline is to be used.

Just to re-iterate: the goal would be, that a customer can run whatever-staging-branch--repo--owner.hlx.page to get a version of helix with the selected services. or: run repo--owner.hlx.page with a strain cookie, e.g. for selecting a different setup for smoke tests.

but maybe, the strains are not suitable for this task?

@trieloff WDYT?

trieloff commented 4 years ago

I'm having some trouble parsing whatever-staging-branch--repo--owner.hlx.page

whatever-staging-branch would be a branch name in helix-pages (which is what I think you mean) or in the customer's repo (which is what it is).

If we want users to select the strain using the hlx.page subdomain, then we need something unambiguous like: strain---ref--repo--owner or strain---repo-owner.

Using strains to select/pin service versions sounds good. We'd need to do these things:

beef up helix-publish so that it can set headers inside the strain selection code (i.e. https://github.com/adobe/helix-pages/pull/459/files#diff-94e7dff6fa4a14a5224018c57a5723feR25-R28)
find a good way of getting the new strain from a helix-pages branch into master. The stuff that we built with @filmaj and @stevengill for the dev site solves exactly that problem, but I need to look up the details

tripodsan commented 4 years ago

yes. so the idea of @davidnuescheler was, that we stage upcoming breaking changes, e.g. one per month. so we would have a september-changes set of altered services + pipeline + pages code.

we would register those changes via a branch in helix-pages, named: september-changes, and setup the respective strains. now a customer can test how the breaking changes will affect his code, by creating a branch september-changes in his repo, and surf the site via that branch. he can then also make adjustments to his client-side code, to fix potential bugs related to the new changes.

tripodsan commented 3 years ago

more or less in place, using the version-set framework: https://github.com/adobe/helix-home/blob/main/docs/version-locking.md

examples:

word2md invokes the smoke tests with a version lock on the deployed CI version.

adobe / helix-home