envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.82k stars 4.77k forks source link

Static/dynamic analysis of data plane exceptions on worker threads #14320

Open htuch opened 3 years ago

htuch commented 3 years ago

In general, there should be no exceptions on worker threads, and they should only happen on the main thread. To validate this at runtime (and during test runs which should catch most instances), I propose we replace all:

try {
  ...
} catch (..) {..}

in Envoy with

envoy_try {
  ...
} catch (..) {..}

where envoy_try is something like:

#define envoy_try \
  ASSERT(gettid() == main_thread_tid); \
  try

This bug tracks this proposal and implementation work. There's probably a number of data plane exceptions which still happen on worker threads, which need to be fixed before the ASSERT can be merged, but we can convert to the new macro to facilitate this. We would also augment check_format to catch any raw try statements.

@envoyproxy/maintainers WDYT? CC @chaoqin-li1123 @asraa

alyssawilk commented 3 years ago

I like it, but wouldn't we need an integration test to catch the failure? Maybe swap it for ENVOY_BUG so if there are any slip-ups they're caught (without crashing) in prod?

If someone is willing to doc up the current use cases, we could maybe run a fixit, or encourage new Envoy devs to pick them up - things like fixing Utility::getResponseStatus would be good intro projects to Envoy

Do you think it would also be worth splitting out data plane code from control plane code by location or naming convention or some such? That way we could also catch it with fix format scripts, and avoid my integration test concerns above.

htuch commented 3 years ago

Why wouldn't existing integration tests catch many of the failures? The only issue is that coverage is often in unit tests vs. integration tests, but it's probably not in scope to increase integration test coverage to 100%. Agreed on the ENVOY_BUG instead of ASSERT.

alyssawilk commented 3 years ago

We only throw on unexpected/invalid behavior and those corner cases are frequently unit tested rather than integration tested. I'd be pleasantly surprised if even with ENVOY_BUG it caught even half the issues in CI, especially after a quick scan of where we throw exceptions.

htuch commented 3 years ago

The aim of wrapping the try, vs. the throw, is that we catch every site that might potentially throw, rather than wait until we actually see exceptions.

alyssawilk commented 3 years ago

oh yeah, you're right, I totally misread this.

So to do this, we'd have to remove the catch we still have in the codec dispatch. I think that's pretty dangerous unless we have plenty more safeguards making sure utilities (like the example I posted) don't throw. Alternately we could leave in that catchall, but I think it mostly defeats the purpose if we still have a catchall catch since I think most throws depend on that?

htuch commented 3 years ago

After some offline discussion with @alyssawilk the goal here is not to remove top-level try, but rather to look for localized uses of try/throw patterns, e.g. in utilities that indicate a failure path by throw.

Another suggestion here is to write a CodeQL scanner that ensures that every throw has an enclosing try, and we can mark top-level try blocks (e.g. around dispatch) as indicative of a check failure.

htuch commented 3 years ago

An example of this class of data plane exception use is https://github.com/envoyproxy/envoy/blob/8188e232a9e0b15111d30f4724cbc7bf77d3964a/source/common/network/io_socket_handle_impl.cc#L283. Ideally we can identify and fix these cases and stop them creeping back in.

We should also ENVOY_BUG on throws to catch places that are not correctly wrapped.

antoniovicente commented 3 years ago

Seems like a good direction. Can we also have a clang-tidy check for use of try?

alyssawilk commented 3 years ago

I'm not sure how we'd do that if we continue allow try to be used for control plane code (hence why I suggested splitting files out) Or maybe it would be possible to just say new data plane code should avoid try/catch as well, and blacklist new instances, but I think that'd be more controversial.

snowp commented 3 years ago

Yeah without some delineation between "this code is ok to run on the data plane" it's hard to prevent accidental usage of throwing utility functions. It's very easy to throw in something like a status code conversion function that appear innocent in code review but that was actually intended for use within a try-catch.

Splitting the code base between data/control plane seems a bit heavy handed (what about all the code that is safely shared?), but it is a fairly low tech solution. Conceptually I would imagine you could make use of static analysis and annotations would be good (e.g. void foo() DATAPLANE_SAFE; that makes it noexcept and ties into linters), which would mean having to explicitly tag all functions that can be used in the dataplane from shared code explicitly. Might be too complicated to get right though, I bet there are plenty of edge cases.

Having the first step being to try to find existing violations makes sense to me, then we can work on prevention in the future.

htuch commented 3 years ago

Control plane uses happen on the main thread (at least those that should be allowed to throw), data plane on worker threads. So, if we assert based on TID we can catch this. Admin endpoint is probably an exception that arguably shouldn't be on main thread (we don't want the admin endpoint to lock-up on large config updates for example).

I agree that new data plane code should avoid try/catch, the main issue is around utilities and nested stacks. Better structure as well would help here, but it will be a lot of work to get to the point that we could use things like build visibility rules to enforce (but arguably a good north star).

alyssawilk commented 3 years ago

" So, if we assert based on TID we can catch this."

where "this" is new try blocks being added, not new exceptions being added, right?. I don't object, but I'm convinced it'll catch enough to be worth the churn.

If we agree that new data plane code should avoid try/catch, I think check_format or clang tidy checks which disallow try/catch by default but could be overridden for legit reasons would result in more consistent future-proofing. We can skip the check for refactor PRs, but by default just disallow new additions.

htuch commented 3 years ago

If we agree that new data plane code should avoid try/catch

How do we know if a utility is safe to use on data plane or control plane? I think this is the crux of the problem.

alyssawilk commented 3 years ago

How many utilities have local try-catch blocks, and wouldn't be caught by the audit you'd need to do to land that macro in the first place? I'm guessing not many, even without restricting what we look at to not-obviously based utilities.. Based on my knowledge of what's integration tested I'm dubious there'd be value add beyond the audit and fix_format checks. I'm not going to block the PR if you find someone to do the work, I just don't think it's going to get us the most exception-saftey-bang for our proverbial buck.

htuch commented 3 years ago

and wouldn't be caught by the audit you'd need to do to land that macro in the first place?

I think 100% of them would be caught, that is the plan outlined today. The idea is to provide a regression framework for new ones.

I think it's becoming clear that a combination of things would make sense here. Worker thread ID-based ENVOY_BUG for both try/throw uses. CodeQL checks to verify that all throws are in fact caught (and not by the top-level dispatch). Annotations at a function level indicating exception safety or data/control plane compatibility.

I've updated the title to make clear the more general set of options. I think I'd defer to the person doing the actual work to prioritize amongst these.

htuch commented 3 years ago

@chaoqin-li1123 has kindly volunteered to do some initial work on this issue.

chaoqin-li1123 commented 3 years ago

Thanks! I will start with the thread id assertion macros and try to replace the raw try catch block one by one.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

chaoqin-li1123 commented 3 years ago

The next step will be adding the try macros into the codebase and also some examples to demonstrate its use.

define envoy_try \

ASSERT(Thread::MainThread::isMainThread()); \ try But currently the implementation of main thread verification utility is tightly coupled with thread local instance. How do we make the main thread checking work with unit test? Providing a fake thread local instance can introduce much overhead in testing.

chaoqin-li1123 commented 3 years ago

We can also remove exception from parseInternetAddressAndPort(https://github.com/envoyproxy/envoy/blob/659f2ea4e404dd3227eeac02766622951013ed6a/source/common/network/utility.cc#L206), parseInternetAddress(https://github.com/envoyproxy/envoy/blob/659f2ea4e404dd3227eeac02766622951013ed6a/source/common/network/utility.cc#L140). Both interfaces take ip string as input and return shared_ptr to an instance. Currently, when the parsing of an IP address fails, throwWithMalformedIp is called to throw an EnvoyException. The exception can be removed by having the interface return nullptr upon a parsing failure and check the returned ptr inside the caller.

htuch commented 3 years ago

It doesn't make sense to verify this behavior in unit tests, I think you would want a condition that only enforces when we are actually in an environment built via a server instance. I agree on IP address parsing; this is an example of a utility that can easily start as being control plane only and then end up in use on the data plane.

chaoqin-li1123 commented 3 years ago

we can have something like ASSERT(Thread::threadingOff() || Thread::MainThread::isMainThread()); Thread::threadingOff() will return true by default before we initialize a thread local instance. The main thread id is verified only after a tls instance is initialized. Or we can use #ifdef to avoid its inclusion in unit test by preporocessor? The former solution makes more sense to me because it is hard to distinguish between unit test and integration test and even some unit tests include threading.

jmarantz commented 3 years ago

Why do you need to disable thread-id checking in unit tests? At least if you are checking that the thread you are in matches the thread ID of the main dispatcher, that would be true in unit tests.

htuch commented 3 years ago

I guess this is true; my main concern is that if we have any thread checking that needs to verify we are in worker threads, then this wouldn't make sense in unit tests, but given the logic we want to check here is that we are on the main thread, that could work.

chaoqin-li1123 commented 3 years ago

Because some unit tests doesn't have a dispatcher and a thread local instance, and main thread singleton is initialized in the constructor of tls instance. (https://github.com/envoyproxy/envoy/blob/40c44e5bffdfe4dbf86dcaa67bed353473bdf196/source/common/thread_local/thread_local_impl.h#L22) At least with the current implementation, that couldn't work.

jmarantz commented 3 years ago

Can you give a specific example of a unit-test that lacks the context needed for this test? We should be able to inject the right context somehow.

Your code example ASSERT(Thread::threadingOff() || Thread::MainThread::isMainThread()); indicates a reliance on globals/singletons/implicit state. I think you'll have an easier time thinking about this if you are not referencing global state but instead injecting what you need.

If you point out a specific case I could make suggestions on how I'd do that.

chaoqin-li1123 commented 3 years ago

For example, in a utility test, there is no main dispatcher. (https://github.com/envoyproxy/envoy/blob/40c44e5bffdfe4dbf86dcaa67bed353473bdf196/test/common/network/utility_test.cc#L40) ASSERT(Thread::MainThread::isMainThread()); will fail because it has not been initialized,

chaoqin-li1123 commented 3 years ago

A tls instance has to be constructed for the assertion to pass. I agree that we can inject something in this case, I am still thinking about it.

jmarantz commented 3 years ago

IMO you should avoid ASSERT(Thread::MainThread::isMainThread()) as a pattern, precisely because it creates non-obvious dependencies on initializing the threading system before running the tests.

Are you implying that we need to have this thread-checking in Utility::hostFromUdpUrl? Why do we need checking for threads in what looks like an algorithmic function?

If that algorithmic function is really depending on some hidden state, let's make the state be explicit and then when we construct it we can provide the threading/dispatching context needed to do the checi.

On Sun, Jan 10, 2021 at 11:06 PM chaoqin-li1123 notifications@github.com wrote:

A tls instance has to be constructed for the assertion to pass. I agree that we can inject something in this case, I am still thinking about it.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/14320#issuecomment-757613273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO2IPLGPHFY2AS44XYRF7LSZJ2KPANCNFSM4USBM25A .

chaoqin-li1123 commented 3 years ago

This may not be a perfect example because a worker thread is not going to parse a udpaurl. I guess we may want to check the thread id when threading is on and an exception will be thrown. We want to remove exception from the data plane, that's why we want to do the checking. This is an algorithmic function, but some algorithmic functions can be involved and throw exceptions in worker thread, which is what we want to avoid.

chaoqin-li1123 commented 3 years ago

To provide some context, we want to replace try with ASSERT(isMainThread); try { So that all exceptions are thrown in main thread.

jmarantz commented 3 years ago

Actually do you need to assert that you are on the main thread at all the places exceptions are thrown? That might be hard because library functions can throw.

Instead maybe you could assert main-thread at all places excepts are caught. There are probably be fewer of them, and maybe they'll have enough context to access the dispatcher. WDYT?

chaoqin-li1123 commented 3 years ago

I see your point. Many library functions don't have enough threading context, but there may be enough context where the exceptions are caught. Actually I don't want to assert in all the throw, I plan to assert in all the try, I guess that is almost equivalent to "assert main-thread at all places excepts are caught". I will do some investigation to see whether we have the necessary context in all the places where exceptions are caught. Good night!

chaoqin-li1123 commented 3 years ago

I have a proposal somehow related to this issue. Currently, the constructor of EnvoyException https://github.com/envoyproxy/envoy/blob/1d1b708c7bf6efa02c41d9ce22cbf1e4a1aeec2c/include/envoy/common/exception.h#L12 takes a const string reference as argument, which means that when we pass in a string literal, like EnvoyException("error message"), a tempory string object will be constructed, and this string will be copied by the constructor of std::runtime_error, the super class of EnvoyException. That means we are allocating the memory twice. Do you think it reasonable to avoid redundant string creation by adding another constructor EnvoyException(const char * message) : std::runtime_error(message) {}

htuch commented 3 years ago

@chaoqin-li1123 probably not worth micro-optimizing, since this is the unhappy path. That said, I think we're trying to move to absl::string_view everywhere, and I'm guessing this probably has a cheap temporary construction, so I'd switch to that as the string reference type.

chaoqin-li1123 commented 3 years ago

Make sense. The cost of creating a tempory string is small when compared to all the stack unwinding in an exceptional scenario.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

chaoqin-li1123 commented 3 years ago

The next PR will change the error propagation of existing envoy code. For codes executed on worker thread, replace try catch with other error propagation mechanism. When all the try catch in existing code has be removed or replaced with try with main thread assertion, we can add format checking to disallow raw try.