envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.11k stars 4.82k forks source link

Envoy WASM extensions in the present and its future (Proxy-Wasm) #35420

Open marc-barry opened 4 months ago

marc-barry commented 4 months ago

Title: Envoy WASM extensions in the present and its future (Proxy-Wasm)

Description:

Envoy current supports WASM extensions via the WASM filter. I am aware of the following warning:

The Wasm filter is experimental and is currently under active development. Capabilities will be expanded over time and the configuration structures are likely to change.

The documentation for the feature is fairly terse and I largely used articles like https://tetrate.io/blog/wasm-modules-and-envoy-extensibility-explained-part-1/ and there was Google Document proposal that I had read a while back but I'm now unable to find the document. We have hit a number of issues in terms of the documentation, extension development process and clear understanding of the adhered ABI spec and plans for that.

There are many more references across the Internet with reference to WASM extensions for things that use Envoy under the hood or have decided to also adopt it as its ABI. But as I mentioned above the spec isn't really even defined in the public space and all work on it appears abandoned or stalled.

What I'm trying to determine is the following:

Relevant Links:

PiotrSikora commented 1 month ago

Could you name some situations in which people need to write trusted extensions? Usually, we trust the developer but not the plugin itself. For example, most of the plugins we run are developed by our teammates. So the technology doesn't need to be fully sandboxed - ensuring our teammate is sane (and code review) is enough. Unless we are lending the Envoy cluster to run the customer's plugins... (maybe that is the Google's use case?)

To add to @martijneken's answer, the isolation provided by Wasm is IMHO quite important and useful even when dealing with plugins authored by trusted developers, since it limits the blast radius in case of non-malicious bugs that could otherwise crash the proxy.

Notably, the rate-limited restart logic is not currently implemented in Envoy, so a buggy plugin might still render the proxy unhealthy, but that's not a limitation of Proxy-Wasm or Wasm in general.

jcchavezs commented 1 month ago

Thanks @martijneken for holding your tongue, it is better to stay quiet when in lack of information.

I've been holding my tongue but this is the opposite of reality. AIUI http-wasm.io was developed by a single ex-Tetrate employee, but it is not maintained and I don't know of any users. It has a fancy website, but quoting someone more familiar than me: "It is safe to assume it’s a dead project".

http-wasm was developed by a bunch of tetrate employees (of course there was a leader who contributed the most) as replacement of proxy-wasm because we were tired of the gatekeeping on the project, poor leadership and the lack of connection with reality and use cases.

http-wasm landed in dapr (see https://docs.dapr.io/reference/components-reference/supported-middleware/middleware-wasm/) and lately traefik went for it as the way to leverage wasm extensions inside the proxy (https://traefik.io/blog/traefik-3-deep-dive-into-wasm-support-with-coraza-waf-plugin/).

The project is now building a community and it is a fact that we are mainly focused in go because wazero had http-wasm as use case in mind. There were some efforts to porting http-wasm to Envoy but not sure what is the status of it.

Is http-wasm the replacement of proxy-wasm? I don't think so, http-wasm was designed from the lessons learnt from proxy-wasm and we specifically tried to keep a narrow API focused on request/response case. Wazero allows you to combine http-wasm with other ABIs to leverage stuff like distributed tracing or socket connections.

I hope there is movement and proxy-wasm gets proper leadership and it becomes maintainable. I don't know the status of the project right now but as someone who worked full time on building wasm plugins I hope it gets to a good shape.

martijneken commented 1 month ago

Thanks for the info @jcchavezs. I'm glad to be wrong about http-wasm, more wasm adoption is good for everyone. My reaction was much more about the claim that ProxyWasm is "abandoned" -- it absolutely is not. We are actively working on it, and in response to the commentary on this thread, we have a draft roadmap nearing publication and will be setting up community meetings to solicit input.

wbpcode commented 1 month ago

It's no doubt that wasm extension is still necessary even we have the dynamic modules done.

The dynamic modules provides a great way to implement a dynamic extension with native performance. But it's hard to be relieved to run third party dynamic modules on the envoy for a public product.

So they basically could be used for different scenarios.

But one of the core target of proxy-wasm is be proxy-independent. That means it's hard to support the Envoy-specific feature or optimization (I think multiple persions here have noticed this point?). And it's also means the proxy may keep a limited feature-sets to be compatible with different proxies? (Constructing a perfect abstraction to adapt different proxied are much complex work.)

Rather than the performance problem (actually, in most cases, if users/developers choose the wasm extension, I think the performance is not their first goal.), more painful thing is that it's still hard to develop a complex extension like external authz with request body. (Note, 5 years has passed after proxy wasm is created.)

We built our product based on the Envoy, from my personal perspective, I will like to fork the proxy-wasm and develop it in Envoy's way. Fixing the issues, resolving the problem in practices, using it more widely, then, we can discuss the standard or spec.

mpwarres commented 1 month ago

WRT @wbpcode's comment:

We built our product based on the Envoy, from my personal perspective, I will like to fork the proxy-wasm and develop it in Envoy's way. Fixing the issues, resolving the problem in practices, using it more widely, then, we can discuss the standard or spec.

I think there are two related but separate considerations: (1) ease/speed of updating Envoy WasmFilter with or without having to manage the external dependency on proxy-wasm-cpp-host, and (2) ability to add Envoy-specific functionality. For (2), there is already precedent and mechanism in source/extensions/common/wasm/ext for adding Envoy-specific hostcalls. That can also be a good place to "try out" a more general hostcall before adding to the standard proxy-wasm ABI.

For (1), I understand the appeal but am worried about divergence from other host implementations, and also (in the reverse direction) missing out on any bugfixes in Envoy that could also benefit other proxy-wasm-cpp-host users. I think that in practice, most Envoy-specific fixes tend to be in the Envoy-side WasmFilter anyways--if there's a need to change proxy-wasm-cpp-host code, chances are that it's a more general issue.

wbpcode commented 1 month ago

@mpwarres thanks for the response.

I think actually even in the core ABI, we may also expect some Envoy-specific things like the stop iteration. (I think this is a common thing, but seems like the proxy-wasm doesn't think so.)

And, considering the existing of the Envoy-specific features, Envoy still need to fork the language SDK to provide these Envoy-specific features to end developers. Only rare people care the spec, most extension developers only care the develop framework or SDK. They develop their extension based on the SDK rather the spec. A well-designed SDK, tools and docs to the SDK is much important than the spec for the end extension developers.

Also, Envoy need to provide related docs based on the Envoy-specific SDK.

If we want let's the wasm support of Envoy be production ready, all these works are unavoidable.

PiotrSikora commented 1 month ago

That means it's hard to support the Envoy-specific feature or optimization (I think multiple persions here have noticed this point?).

Do you have anything specific in mind?

I think actually even in the core ABI, we may also expect some Envoy-specific things like the stop iteration. (I think this is a common thing, but seems like the proxy-wasm doesn't think so.)

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: https://github.com/proxy-wasm/proxy-wasm-cpp-host/pull/95#issuecomment-725788291), not because of other proxies.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

And, considering the existing of the Envoy-specific features, Envoy still need to fork the language SDK to provide these Envoy-specific features to end developers.

Proxy-Wasm supports custom hostcalls and callbacks (e.g. https://github.com/envoyproxy/envoy/pull/32127) and the existing SDKs support calling those without the need to fork them.

Only rare people care the spec, most extension developers only care the develop framework or SDK. They develop their extension based on the SDK rather the spec. A well-designed SDK, tools and docs to the SDK is much important than the spec for the end extension developers.

I 100% agree with you, but there is a vocal group that kept blaming all the issues with Proxy-Wasm on the missing specification, so unfortunately that took the priority...

johnlanni commented 1 month ago

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: https://github.com/proxy-wasm/proxy-wasm-cpp-host/pull/95#issuecomment-725788291), not because of other proxies.

@PiotrSikora I believe that instead of removing support for StopIteration due to this crash, the issue can be addressed through checks within the SDK or on the Host side to prevent developers from writing erroneous code that leads to Envoy crashes. StopIteration plays a significant role, and its removal would hinder the implementation of many functionalities. Consequently, we had no choice but to fork the repository in Higress, adjusting the ABI to support return more value types.

wbpcode commented 1 month ago

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: proxy-wasm/proxy-wasm-cpp-host#95 (comment)), not because of other proxies.

@PiotrSikora I believe that instead of removing support for StopIteration due to this crash, the issue can be addressed through checks within the SDK or on the Host side to prevent developers from writing erroneous code that leads to Envoy crashes. StopIteration plays a significant role, and its removal would hinder the implementation of many functionalities. Consequently, we had no choice but to fork the repository in Higress, adjusting the ABI to support return more value types.

This is why I think we could fork it in the Envoy directly and enter more quick iteration. Because proxy-wasm has been forked in some way, for example, the higress from alibaba cloud, which I think has big influence to adoption of wasm extension in Chinese cloud market.

johnlanni commented 1 month ago

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

wbpcode commented 1 month ago

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: https://github.com/proxy-wasm/proxy-wasm-cpp-host/pull/95#issuecomment-725788291), not because of other proxies.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

I think we should treat it as bug and fix it. There is no way to complete forbid the extension to do some harmful operations.

Do you have anything specific in mind?

filter state, dynamic metadata, same route cache control with Envoy, modification of route, route specific configuration, etc.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

I personally think support the stop iteration and create a beginer-friendly wrapper in the SDK would the better way to do this.

Proxy-Wasm supports custom hostcalls and callbacks (e.g. https://github.com/envoyproxy/envoy/pull/32127) and the existing SDKs support calling those without the need to fork them.

I know we can call it with proxy_call_foreign_function.

But I am not sure it's a good choice to let the end developers to call CallForeignFunction and handle the parameters' serialization. For example:

CallForeignFunction("set_envoy_filter_state", <serialized_proto>);

This just sacrify the experience of the end extension developers. I think the requirement and experience of the end extension developers are most important. They make the actual value of the wasm extension. We just a provider of tools. If we cannot provider good experience or cannot address their requirement, then, they will choose other tools, like lua , dynamic modules, third-parity forks, etc.

wbpcode commented 1 month ago

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

I think current route cache control and body modification also are you annoyances 🤣

johnlanni commented 1 month ago

@wbpcode Yes, to achieve this, we also made some minor hacks to Envoy, but they are not ABI-incompatible changes. I am willing to create an issue later to outline the capabilities we have implemented that are not currently satisfied by the official repo, so everyone can discuss which ones are worth being officially implemented.

wbpcode commented 1 month ago

Theoretically, based that on get/set property + foreign functions could do almost everything and needn't to change the ABI function's signatures. But the key isn't only the signatures, is the feature set that be exposed to end developers.

When we forked and hacked the SDK+ host (Envoy) to provide some specific features, then the intermediate bridge (proxy-wasm host lib) and ABI cannot represent the actual feature set, the compatibility actually has been broken.

PiotrSikora commented 1 month ago

Yes, the Higress community has 40+ wasm plugins,

That's awesome!

I personally think support the stop iteration and create a beginer-friendly wrapper in the SDK would the better way to do this.

But you need a new ABI version for that, otherwise you'll have new plugins returning StopIteration that are deployed on older versions of Envoy/Istio/X, where they behave differently than expected.

I know we can call it with proxy_call_foreign_function.

But I am not sure it's a good choice to let the end developers to call CallForeignFunction and handle the parameters' serialization. For example:

CallForeignFunction("set_envoy_filter_state", <serialized_proto>);

This just sacrify the experience of the end extension developers.

Right, but you can easily add first-class wrappers to the SDKs for those, so that end-users won't be able to tell whether it's standardized ABI call or a foreign functions, and once the interface is proven, then it can be included in the next ABI version.

Fast iteration and prototyping is exactly what this interface was designed for, and it doesn't require forking anything.

PiotrSikora commented 1 month ago

filter state

This is already supported as a custom set_envoy_filter_state hostcall.

dynamic metadata

Is this also used in presence of filter state? I thought only one of those was supposed to be used going forward? See: https://github.com/envoyproxy/envoy/issues/4929

same route cache control with Envoy

This indeed is Envoy-specific, but we already have a dedicated issue for it: https://github.com/proxy-wasm/proxy-wasm-cpp-host/issues/421

modification of route

Do you mean selection of the upstream server/cluster or something else?

route specific configuration

Do you mean per-route plugin configuration? Isn't this supported in Envoy by composite filter? In any case, Proxy-Wasm already supports running many instances of the same plugin with different configurations inside the WasmVM, so this seems more of a host implementation issue.

wbpcode commented 1 month ago

This is already supported as a custom set_envoy_filter_state hostcall.

Yeah. But we still need to wrap it to avoid exposing it to end developers in the origin way.

Is this also used in presence of filter state? I thought only one of those was supposed to be used going forward? See: https://github.com/envoyproxy/envoy/issues/4929

AFAIK, some auth filters will prefer the dynamic metadata to store json-like data.

Do you mean selection of the upstream server/cluster or something else?

I mean allow the filter to change the upstream cluster by rewrite the route.

Do you mean per-route plugin configuration?

Yeah.

Isn't this supported in Envoy by composite filter?

The composite actually make things more complex and hard to use for users.

In any case, Proxy-Wasm already supports running many instances of the same plugin with different configurations inside the WasmVM, so this seems more of a host implementation issue.

Considering the overhead of a WasmVM, it's hard to same to run lots of different vm instances is good solution.

But anyway, yeah, maybe we also need to rethinking this problem at host. Lots of problems is hard to resolve by single side.

wbpcode commented 1 month ago

Right, but you can easily add first-class wrappers to the SDKs for those, so that end-users won't be able to tell whether it's standardized ABI call or a foreign functions, and once the interface is proven, then it can be included in the next ABI version. Fast iteration and prototyping is exactly what this interface was designed for, and it doesn't require forking anything.

I didn't get it. I think the SDK maintained by the proxy-wasm also should be proxy-independent? Or it just make the end developers be confused. They used the SDKs, and develop an extension at their platform, then find lots of interface actually cannot work?

botengyao commented 1 month ago

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

@johnlanni, off the wasm topic, and noticed Alibaba Higress is using Envoy, which is great! Does Alibaba plan to be on the security distributor list to receive and report early CVE notifications under embargo?

johnlanni commented 1 month ago

@botengyao Absolutely, and it appears we meet the requirements to join the distributor list. I have already sent an email.

PiotrSikora commented 1 month ago

The composite actually make things more complex and hard to use for users.

Agreed, but when we were adding Wasm to Envoy, we were told not to handle this ourselves, and instead use the (still under development at the time) composite filters, which were supposed to address this problem for all Envoy extensions.

In any case, this should be already solved in Proxy-Wasm (but perhaps it needs to be glued together with the per-route configuration in Envoy), since you can have multiple plugin instances with different configurations, so route configuration A and route configuration B can be instantiated as plugin with configuration A and plugin with configuration B... unless I'm missing something?

Considering the overhead of a WasmVM, it's hard to same to run lots of different vm instances is good solution.

But anyway, yeah, maybe we also need to rethinking this problem at host. Lots of problems is hard to resolve by single side.

Proxy-Wasm already supports running multiple instances of the same plugin with different configurations inside the same WasmVM (i.e. many-to-one), so there is no extra overhead here.

This is how the configuration reloads are handled in Envoy, and how the same plugin is used with different configurations in different filter chains (assuming the same vm_id).

I didn't get it. I think the SDK maintained by the proxy-wasm also should be proxy-independent? Or it just make the end developers be confused. They used the SDKs, and develop an extension at their platform, then find lots of interface actually cannot work?

Yes and no. I want to be as conservative as possible, but at the same time we should prevent unnecessary fragmentation of the Proxy-Wasm ecosystem and avoid splitting the already limited engineering resources.

Based on the discussion in this thread, if we ignore the generic features that will be added in the upcoming ABI update (e.g. complete request buffering) and things that should be handled in Envoy (route cache control and per-route configuration), then it seems that there are very few Envoy-specific features (filter state & dynamic metadata).

As such, it's probably more productive to add clearly named wrappers to the existing SDKs for those 2 or 4 (for both getters and setters) custom hostcalls, than to fork away, which might prevent plugins written using those alternative SDKs from running on non-Envoy Proxy-Wasm hosts, even when they don't require any Envoy-specific features.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.