Open mrice32 opened 6 years ago
cc @jmarantz @htuch @mattklein123 @ofek
What's the practicality of this point of changing the syntax of one of those stats so they don't conflict post-tag-extraction?
alternatively can we make the tag-extraction process leave them in distinct states?
@jmarantz I think you're talking about how to fix this particular collision - I'll follow up on that in #2683 because that issue is tracking that bug.
Well, kind of. If we can change the algorithm so that collisions cannot arise, that would obviate the need to detect it.
@mrice32 Pardon me if I'm wrong, but do you think this one https://github.com/envoyproxy/envoy/issues/2597 is also related?
Not sure how the second map helps if ultimately we need to output disjoint sets of tag extracted values for the distinct stats that are colliding. I think the solution that @jmarantz suggests would disambiguate at the expense of changing the output format in a non-compatible way.
Other ideas:
@dio I'm not sure I understand prometheus well enough to make a judgement. Is the error message shown there implying that there are two stats with the same name, but different types?
@htuch To quickly clarify, the map would only serve the purpose of detecting collisions so we could ASSERT
that they don't occur. This would allow us to catch these issues in integration tests. This is a categorically different from @jmarantz's solution.
Just fyi, in #2683 I found one final collision(s).
I've manually went through every metric Envoy collects and in that issue I've listed all collisions.
I think doing some type of checking and assertion at minimum is probably worth it. Having a 2nd map doesn't seem like that big of a deal to me since stat allocation itself is already assumed to be the slow path. I think doing (3) is going to be very difficult without major changes to how stats are created. I'm not sure it's wort it. I feel like with checking we would catch many/all issues.
@mrice32 one thing to point out is that if you add checking, we should also add the same checking to IsolatedStatsStore
as that is what is used in many/most tests.
Another option is to only do the tag checking map on debug builds since the goal is to ASSERT anyway. Just throwing that out there.
@ofek Thanks for doing all the manual searching!
@mattklein123 Makes sense. I'll be sure to add it to both stores. Yeah, as long as we don't mind too many #ifdefs floating around, I think adding it to just the debug builds is the right way to go.
@mrice32 yes, this part of the sample output (duplications of TYPE envoy_listener_http_downstream_rq_xx counter
) makes promlint
doesn't happy:
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="2",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 2
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="4",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 551
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="3",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 0
# TYPE envoy_listener_http_downstream_rq_xx counter
envoy_listener_http_downstream_rq_xx{envoy_response_code_class="5",envoy_http_conn_manager_prefix="ingress_http",envoy_listener_address="0_0_0_0_80"} 0
The relevant test here https://github.com/envoyproxy/envoy/blob/master/test/integration/integration_admin_test.cc#L175 I believe should catch any duplication (but as commented in #2597, proper lint check by promlint
to outputs of https://github.com/envoyproxy/envoy/blob/b8f2268e6bd1dd2a7cb80bde64267dc7c8384768/source/server/http/admin.cc#L430 could probably useful).
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
I'd like to explore ways to prevent tag extracted name collisions in Envoy since there have been multiple issues brought up wrt that (see #2683 and #2141)
Description: I want to start by defining a collision in this context. Stats in Envoy are created with a single string. Multiple strings can map to the same tag extracted name when the stats are logically similar (only differ by tag values). This is expected. However, it is also assumed that all strings that map to the same tag extracted name have had the same named tags extracted from it - not the same tag values, but the same tag names. The issues above come up when this assumption breaks down because there are currently no checks to ensure the latter isn't violated. Pulling an example from #2683:
http.<stat_prefix>.fault.aborts_injected
andhttp.<stat_prefix>.fault.<downstream-cluster>.aborts_injected
both resolve tohttp.fault.aborts_injected
. These are logically different stats because one represents the total of some metric and the other represents some slice of that metric. The same metric with the same name (tag extracted name in Envoy) and different tag values (in a tagged system) is, in my experience, used to represent a completely disjoint slice of the metric's space with no overlap with other stats with a different tag combination. Given that setup, this sort of collision seems undesirable.As @mattklein123 mentioned in #2683 this is somewhat difficult to check this since statsd can be created anywhere and they are "keyed" in Envoy caches by their full string name, which won't prevent collisions relating to the tag-extracted name. A rough solution could look like adding a second map (next to the existing TLS cache) that maps a particular tag-extracted name to some sort of hash of the list of tag names for that tag extracted name (or could be approximated by just the number of tags). That would, however, add some complexity and overhead to the stat allocation process. What are others' thoughts on this?