envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.71k stars 4.76k forks source link

Preventing Tag Extracted Name Collisions #2687

Open mrice32 opened 6 years ago

mrice32 commented 6 years ago

I'd like to explore ways to prevent tag extracted name collisions in Envoy since there have been multiple issues brought up wrt that (see #2683 and #2141)

Description: I want to start by defining a collision in this context. Stats in Envoy are created with a single string. Multiple strings can map to the same tag extracted name when the stats are logically similar (only differ by tag values). This is expected. However, it is also assumed that all strings that map to the same tag extracted name have had the same named tags extracted from it - not the same tag values, but the same tag names. The issues above come up when this assumption breaks down because there are currently no checks to ensure the latter isn't violated. Pulling an example from #2683:

http.<stat_prefix>.fault.aborts_injected and http.<stat_prefix>.fault.<downstream-cluster>.aborts_injected both resolve to http.fault.aborts_injected. These are logically different stats because one represents the total of some metric and the other represents some slice of that metric. The same metric with the same name (tag extracted name in Envoy) and different tag values (in a tagged system) is, in my experience, used to represent a completely disjoint slice of the metric's space with no overlap with other stats with a different tag combination. Given that setup, this sort of collision seems undesirable.

As @mattklein123 mentioned in #2683 this is somewhat difficult to check this since statsd can be created anywhere and they are "keyed" in Envoy caches by their full string name, which won't prevent collisions relating to the tag-extracted name. A rough solution could look like adding a second map (next to the existing TLS cache) that maps a particular tag-extracted name to some sort of hash of the list of tag names for that tag extracted name (or could be approximated by just the number of tags). That would, however, add some complexity and overhead to the stat allocation process. What are others' thoughts on this?

mrice32 commented 6 years ago

cc @jmarantz @htuch @mattklein123 @ofek

jmarantz commented 6 years ago

What's the practicality of this point of changing the syntax of one of those stats so they don't conflict post-tag-extraction?

alternatively can we make the tag-extraction process leave them in distinct states?

mrice32 commented 6 years ago

@jmarantz I think you're talking about how to fix this particular collision - I'll follow up on that in #2683 because that issue is tracking that bug.

jmarantz commented 6 years ago

Well, kind of. If we can change the algorithm so that collisions cannot arise, that would obviate the need to detect it.

dio commented 6 years ago

@mrice32 Pardon me if I'm wrong, but do you think this one https://github.com/envoyproxy/envoy/issues/2597 is also related?

htuch commented 6 years ago

Not sure how the second map helps if ultimately we need to output disjoint sets of tag extracted values for the distinct stats that are colliding. I think the solution that @jmarantz suggests would disambiguate at the expense of changing the output format in a non-compatible way.

Other ideas:

  1. Rule it illegal to form such collisions, same way as is done in programming languages when you do undefined behavior.
  2. Provide well defined rules for who wins in a collision, blackhole problematic stats. Avoid any inbuilt stats that could have this problem.
  3. Force stats to be allocated at config load time, detect any collisions and raise a configuration exception if they happen.
mrice32 commented 6 years ago

@dio I'm not sure I understand prometheus well enough to make a judgement. Is the error message shown there implying that there are two stats with the same name, but different types?

@htuch To quickly clarify, the map would only serve the purpose of detecting collisions so we could ASSERT that they don't occur. This would allow us to catch these issues in integration tests. This is a categorically different from @jmarantz's solution.

  1. This is certainly the lowest lift way to do it. However, I think the difficulty here is that it's relatively easy to accidentally create stats that collide since it depends on the combination of the stats you create and the default regexes. I don't think people will be inclined to manually check, and these bugs will likely continue popping up.
  2. I think this is reasonable, but it would require some sort of detection.
  3. This is probably the solution that I would most prefer, but I think there are some stats that are not allocated at config load time now (I think some histograms are just created on demand). Like 2, this would also require some sort of collision detection.
ofek commented 6 years ago

Just fyi, in #2683 I found one final collision(s).

I've manually went through every metric Envoy collects and in that issue I've listed all collisions.

mattklein123 commented 6 years ago

I think doing some type of checking and assertion at minimum is probably worth it. Having a 2nd map doesn't seem like that big of a deal to me since stat allocation itself is already assumed to be the slow path. I think doing (3) is going to be very difficult without major changes to how stats are created. I'm not sure it's wort it. I feel like with checking we would catch many/all issues.

@mrice32 one thing to point out is that if you add checking, we should also add the same checking to IsolatedStatsStore as that is what is used in many/most tests.

Another option is to only do the tag checking map on debug builds since the goal is to ASSERT anyway. Just throwing that out there.

mrice32 commented 6 years ago

@ofek Thanks for doing all the manual searching!

@mattklein123 Makes sense. I'll be sure to add it to both stores. Yeah, as long as we don't mind too many #ifdefs floating around, I think adding it to just the debug builds is the right way to go.

dio commented 6 years ago

@mrice32 yes, this part of the sample output (duplications of TYPE envoy_listener_http_downstream_rq_xx counter) makes promlint doesn't happy:

# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="2",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 2
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="4",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 551
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="3",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 0
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="5",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 0

The relevant test here https://github.com/envoyproxy/envoy/blob/master/test/integration/integration_admin_test.cc#L175 I believe should catch any duplication (but as commented in #2597, proper lint check by promlint to outputs of https://github.com/envoyproxy/envoy/blob/b8f2268e6bd1dd2a7cb80bde64267dc7c8384768/source/server/http/admin.cc#L430 could probably useful).

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.