envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.14k stars 4.83k forks source link

Proxy-WASM: Controlling contexts in other threads #17576

Open rahulanand16nov opened 3 years ago

rahulanand16nov commented 3 years ago

Hi, there maintainers/authors!

I'm working on building a WASM filter for my GSoC'21 project under the mentorship of @unleashed which is using shared data as a cache. Whenever a new request comes whose info is not present in the cache - the proxy does a callout to fetch information from an external service. Taking into account the environment in which this filter will be used, there can be multiple concurrent identical requests (i.e targeting a single key in the shared data). Right now, there is no way to avoid multiple callouts from different threads at the same time. We thought of adding an identifier in the cache signaling other identical requests in different threads about the outbound callout but there is no ability to pause and resume after the response.

Is there a way to achieve this behavior in the current proxy-wasm state? or does it require changes in the host?

Pictorial presentation:

                                 callout
                                    ▲           response
                      Cache miss    │              │
Thread(T1) R1(Key=1)────────────────┘              ▼ ──── on_http_call_response called

Thread(T2) R2(Key=1)─────────── (paused).........   ───── **No Event Present**

Thread(T3) R3(Key=1)─────────── (paused).........   ───── **No Event Present**

SDK: Rust v0.1.4

cc @PiotrSikora @mathetake @NomadXD

mathetake commented 3 years ago

I don't think currently we could control other threads' contexts asynchronously. The solution might be just running busy wait inside the VM (in the ^^ example, T2, T3 are running busy loops waiting for the cache to be stored by T1).

Right now, there is no way to avoid multiple callouts from different threads at the same time.

Why do you need to avoid multiple callouts? In any situation, if you have multiple instances of Envoy, the backend would get multiple requests.

PiotrSikora commented 3 years ago

I don't think currently we could control other threads' contexts asynchronously. The solution might be just running busy wait inside the VM (in the ^^ example, T2, T3 are running busy loops waiting for the cache to be stored by T1).

...but then T2 and T3 are completely blocked and cannot process other requests for however long it takes to fetch a response.

@rahulanand16nov it's not perfect, but I think you could use timers and re-check cache every few ms. Note that you'll need to keep track of which context(s) to resume for each cache key in your code.

unleashed commented 3 years ago

@PiotrSikora @mathetake is there any plan or interest in exposing APIs to the WASM modules with sync semantics to allow for these use cases to be implemented cleanly in modules and efficiently in the host so we avoid busy-waiting?

rahulanand16nov commented 3 years ago

@mathetake That's one solution, but it will render those threads unable to service other requests, as mentioned by @PiotrSikora.

@rahulanand16nov it's not perfect, but I think you could use timers and re-check cache every few ms. Note that you'll need to keep track of which context(s) to resume for each cache key in your code.

That's the only viable option available to us that gives some relief to waiting threads to do some useful work and yeah I agree it's not perfect but a work-around for now.

Here, if someone in the future is looking for an implementation: https://github.com/3scale-labs/rahul-workspace/tree/main/unique-requests-poc

Also, plus one for what @unleashed asked.

PiotrSikora commented 3 years ago

@unleashed @rahulanand16nov that depends on what exactly do you have in mind.

Direct control over contexts in other worker threads is definitely something that we don't want to do.

However, signaling via shared data is already supported using Shared Queues, with a caveat that only the last worker thread that called proxy_register_shared_queue is notified about new items via proxy_on_queue_ready callback. That isn't great, since all the items are sent to a single worker thread. I was planning on adding explicit round-robin and broadcast options in the future versions of the ABI, but seeing that broadcast shouldn't break any existing plugins (at worst, we get a thundering herd problem, but only one worker can dequeue a single item, so it's a benign side-effect), perhaps we could change the current implementation to that mode. Note that only worker threads that called proxy_register_shared_queue and not proxy_resolve_shared_queue would be notified in the broadcast mode. @mathetake any concerns?

unleashed commented 3 years ago

@PiotrSikora I don't think @rahulanand16nov wants or needs direct control over other worker threads, just sync semantics to efficiently wait for some/arbitrary events. (so perhaps best to change the title of this issue).

I think that he'd be looking forward to something similar to the broadcast mechanism you laid out above: he just needs to put a number of threads to sleep on an event and then wake them up so they can pick up new data and progress. That said, I am sure people will come up with use cases where they'd benefit from waking up one single thread only.

@rahulanand16nov with the current state of affairs I think you could hack something on top of the limited queue mechanism by establishing a chain of threads to wake up - might be worth a thought.

mathetake commented 3 years ago

However, signaling via shared data is already supported using Shared Queues, with a caveat that only the last worker thread that called proxy_register_shared_queue is notified about new items via proxy_on_queue_ready callback. That isn't great, since all the items are sent to a single worker thread. I was planning on adding explicit round-robin and broadcast options in the future versions of the ABI, but seeing that broadcast shouldn't break any existing plugins (at worst, we get a thundering herd problem, but only one worker can dequeue a single item, so it's a benign side-effect), perhaps we could change the current implementation to that mode. Note that only worker threads that called proxy_register_shared_queue and not proxy_resolve_shared_queue would be notified in the broadcast mode. @mathetake any concerns?

IIUC, proxy_on_queue_ready is already called for the same vm_id in all threads with the current implementation, which means we already breadcasting?

PiotrSikora commented 3 years ago

IIUC, proxy_on_queue_ready is already called for the same vm_id in all threads with the current implementation, which means we already breadcasting?

No, that dispatches it only on a single thread.

https://github.com/proxy-wasm/proxy-wasm-cpp-host/blob/03185974ef574233a5f6383311eb74a380146fe2/src/shared_queue.cc#L91-L93

https://github.com/envoyproxy/envoy/blob/c98cd1320d7aed7bfa1de2a8313d1d116e68833a/source/extensions/common/wasm/wasm.cc#L200-L203

mathetake commented 3 years ago

Oh ic thanks! No concerns from me to changing the current implementation to the broadcast mode. That sounds reasonable plus I see the actual use case here.

PiotrSikora commented 3 years ago

@rahulanand16nov @unleashed FYI, I don't believe anyone is working on fixing this, is this something that you could do?

unleashed commented 3 years ago

@rahulanand16nov @NomadXD can you guys give it a go?

rahulanand16nov commented 3 years ago

@PiotrSikora @unleashed We (@NomadXD and I) will give it a try and keep you guys posted here.

rahulanand16nov commented 3 years ago

@PiotrSikora @mathetake Sorry for the late update; we have been slowly ramping up on the code and brainstorming how to implement it alongside other work for the past two weeks. We were thinking of moving towards a broadcast group implementation where any thread can register its shared queue without changing existing queue implementation in the ABI, i.e., a new proxy_.... function. I think that adding a new function won't affect the stability of the ABI, which is why changes to ABI are discouraged.

We used a similar feature in our GSoC project in which these groups were entries in the shared data with the list of queue ids as value. This allowed us to broadcast to a group of threads without requiring changes in the ABI. This is why we are hesitant about the broadcast implementation because it increases non-useful work by requiring other threads to filter each notification to make sure if it's for them.

To increase the degree of freedom and allow more use cases, we propose an event registration mechanism. A user can register from predefined events like callouts, queue-events, etc., and supply a callback to trigger if that event occurs. Extension to this idea is similar to streaming queries, in which a pattern of events is defined with a trigger. Note that broadcast can be implemented here as well when threads register for the same event.

mathetake commented 3 years ago

If we really want to add that kind of new ABI, then we should definitely go to proxy-wasm/spec repo for discussion, and should not be done here. And @PiotrSikora is working on the next version of ABI and should have consistency there.

And as for the change, let's just change the current behavior of Envoy (posting on one thread) to the broadcast mode. That would not require any change rather than Envoy code base and this is something we agree here (https://github.com/envoyproxy/envoy/issues/17576#issuecomment-897521681).

rahulanand16nov commented 3 years ago

Sure, it makes sense to take ABI-related conversation to the spec repo. For now, we'll focus on getting the broadcast implementation out ASAP then.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

unleashed commented 3 years ago

I see the referenced PR has been closed and no other link exists while the issue has gone stale - is this issue resolved elsewhere or is it otherwise really stale with no one working on it and no plans moving forward?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.