envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.68k stars 4.75k forks source link

Stat name collisions #2683

Open ofek opened 6 years ago

ofek commented 6 years ago

Description:

https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/fault_filter#statistics

edit:

I found one more https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#alternate-tree-dynamic-http-statistics

Those resolve to the same as https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#dynamic-http-statistics

mrice32 commented 6 years ago

I'd like to broaden this issue a bit since a collision has happened before here. In addition to correcting the collision above, we should add ASSERTs to ensure they don't happen anywhere else. Also, because the tag system is configurable and we want to maintain its flexibility, maybe it should be possible to disable these ASSERTs as well, but I'm less sure if that makes sense. Thoughts?

cc @mattklein123 @jmarantz @htuch

mattklein123 commented 6 years ago

@ofek I'm confused how there is a collision here. I don't think there is. Can you expand?

@mrice32 sure we can do collision detection somehow, though TBH I'm not sure exactly how since the stats system is built such that multiple code sites can access the same stat. Can you open a fresh issue for discussing that?

ofek commented 6 years ago

@mattklein123 The tag-extracted metric names are the same, so there is no way to discern the meaning unless meaning is influenced by the number of tags, which is not right.

mrice32 commented 6 years ago

@mattklein123 See #2687.

mattklein123 commented 6 years ago

@ofek yup OK sorry didn't realize this was a tag extraction collision. I will defer to @mrice32 and @taiki45 on how to resolve.

mrice32 commented 6 years ago

@jmarantz I think you were asking whether we should change the names of those stats or modify the extraction process to fix this collision instance. tl;dr Without changing the names of the stats themselves, I'm not sure how to fix this instance.

I think it will be difficult to use the same sort of regex fix that we used in #2141 here because that took advantage of Envoy-defined structure in the tag value. The tags here are user defined, so we can't assume any structure for them. The current system doesn't allow tag extraction to add anything to the name - only remove, which limits our mitigation options here.

jmarantz commented 6 years ago

So I still don't really understand the bounds of the solution space.

What if the tag extraction process didn't merely remove the tag values, but replaced them with the values with the names of the extracted tags? Of course that would be a stats_impl change that would require some effort, but I don't know who is dependent on the current exact syntax of the tag-extracted names.

mrice32 commented 6 years ago

@jmarantz That's an interesting idea. I have no problem with something like that, although IMO, it makes the names a little less intuitive to read if they're being directly ingested by other stats systems. I'll pull in a few folks who have worked on integrating with other stats systems to hear their thoughts: @srvaroa @taiki45 @JonathanO @lita. Feel free to pull in others who may have opinions here.

ofek commented 6 years ago

@jmarantz @mrice32 Can you please give an example of what that might look like? Also fyi, @DataDog will (using the /stats endpoint) soon depend on tag-extracted names being unique metric names.

ofek commented 6 years ago

I found one more https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#alternate-tree-dynamic-http-statistics

Those resolve to the same as https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#dynamic-http-statistics

mrice32 commented 6 years ago

@ofek An example of @jmarantz's proposed solution is that http.my_custom_prefix.fault.delays_injected resolves to http.envoy.http_conn_manager_prefix.fault.delays_injected as its tag extracted name. This means that collisions will essentially never occur.

Also, both solutions will affect user visible stats outputs, so I'm curious what Envoy's change policy is with respect to stats? Should we go through a deprecation cycle where we have the old name and new name coexist for a version?

ofek commented 6 years ago

@mrice32 @jmarantz Using the /stats endpoint, how could one derive my_custom_prefix from envoy.http_conn_manager_prefix? If you can't, then the output would be more or less useless to consumers of the stats API.

mrice32 commented 6 years ago

@ofek http.envoy.http_conn_manager_prefix.fault.delays_injected would be the new tag extracted name (this example would resolve to http.fault.delays_injected today). The vector of tags and tag names would still accompany the tag extracted name as they do today, so envoy.http_conn_manager_prefix would be listed as a tag in that vector with a tag_value of my_custom_prefix.

ofek commented 6 years ago

Ah, now I understand. Yes that envoy prefix indication of the tag would be great.

It would be really, really useful if the json output had a format like:

{
  "name": "http.fault.delays_injected",
  "value": 0,
  "tags": {
    "stat_prefix": "my_custom_prefix"
  }
}
github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

ofek commented 3 years ago

Are collisions now impossible?

htuch commented 3 years ago

Yeah, I don't think this has changed. Reopened and tagged "help wanted". @jmarantz is there any suggested resolution on your side?